Big Data

Computer Science & Information Systems

Big Data is a buzzword. It regards storing and processing large amounts of data. In this course, we discuss the following topics in Big Data:

  • Big Data Definition
  • Big Data Characteristics and Challenges
  • Hadoop
    • Hadoop Distributed File System (HDFS)
    • MapReduce Programming
  • Apache Spark
    • Resilient Distributed Datasets (RDDs)
    • Pair Resilient Distributed Datasets (PairRDDs)
    • Spark SQL
    • Pandas on Spark

Below you will find the main datasets used in this course and their respective link.

Dataset   Link
Airports.csv   Link
Bible.txt   Link
Forest Fire   Link
JY157487.1   Link
RealEstate   Link
Transactions (sample)   Link
UK Makerspace   Link
UK Postcode   Link
Give me Loan   Link

You will also find a setup for your computer here.


The tools we use during this course.