Skip to content



Folders and files

Last commit message
Last commit date

Latest commit



16 Commits

Repository files navigation

Topics in Big Data


The goal of this class is to cover topics in Big Data. The focus will be on principles and practices of data storage, data modeling techniques, data processing and querying, data analytics and applications of machine learning using these systems. We will learn about application of these concepts on large scale urban analytics in cities.


Abhishek Dubey (first name . last name at

My research focus is on application of big data and machine learning for creating large scale social cyber-physical systems such as transportation networks. For more details visit my group project page.

Contact We will use brightspace and piazza for communication. You can also create issues on github to point to any specific problem with an example or assignment.

Office Hours Office Hours will be available on Friday at 1 PM to 2 PM at my office in Institute for Software Integrated Systems.

I am also available by appointment as required. Send me an email if you need to meet with me outside the office hours.

Teaching Assistant

Nithin Guruswamy [Office Hours TBD]


  • Matthew Buruss
  • Thomas Mallick

Course Expectation

In the course we will be heavily using Python for programming excercises and analysis. We will also be using google collaboratory and Amazon Web Services. Knowledge of python is required. You are also expected to know how to use github and clone repositories.

Reading Material

Most lectures I will assign reading material. You are expected to read it before the next class. The reading material that you should finish before lecture_{i+1} will be sent as reading list in lecture_{i} folder. See 01-introduction/reading for example.

Topics to be covered

  • Applications of Big Data
  • History of Database and Big Data Systems
  • Big Data Infrastructure
    • Computing Clusters.
    • Understanding the database anatomy and optimizing access
    • Online transactions
    • Understanding NoSQL
    • Column storage vs Row Storage
    • Distributed System Resilience - Zookeeper, HashiCorp Consul, System-d and Google Chubby, Paxos and Consensus
    • Storage Produts: Big Table, Dynamo DB, Spanner, Memcached
  • Computation Models and Big Data Processing (Batch Data)
    • Classical Workflow Systems
    • Map Reduce and HDFS
    • Spark and RDD
  • Computation Models and Big Data Processing (Streaming Data)
    • Pulsar and Kafka - Data Collection and Management (Pub/Sub systems)
    • Storm and Heron
  • Analytics
    • Clustering and Dimensionality Reduction
    • Link Analysis and Page Rank
    • Large Scale Machine Learning
  • Practical Applications (Projects)
    • City Scooter Data Analysis
    • City Accident Data Analysis
    • Recommender Systems
    • Transit Energy systems

Exams, Quizzes, Assignments and projects

  • 4 take home quizzes.
  • 4 pop quizzes. They are primarily used to judge class participation. Unless there is a good reason (it is important to inform me beforehand) there will be no makeup.
  • 5 programming assignments.
  • There will be a mid-term exam on 2/26/2020
  • The final exam will be replaced by a final project and summary report to be presented and submitted by the end of the day of the final class.

No collaborations unless explicitly permitted.

The Vanderbilt Honor Code will govern work done. ANY VIOLATIONS WILL RESULT in the case to be reported to the honor council. You are welcome to refer to the online sources for your assignments. However, you must not copy the code and must provide citation of the source of inspiration. All work will be submitted via github.


The following grading criteria are tentative and are subject to change. Each graded item in this course will be assigned a certain number of points. Your final grade will be computed as the total number of points you achieved divided by the number of points possible. The instructor reserves the right to apply a curve to the final result.

Grading Criteria

Category Percentage
Programming Assignments 30%
Class Pop Quizzes 10%
Take Home Quizzes 20%
Mid Term Exam 20%
Final Project 20%

Course Policies

Submissions will be due by midnight on the day mentioned in the assignment and homework description. Late submissions will be penalized with an automatic 20 percent penalty per day (applied relative to the graded score for the submission).

Letter Grade Distribution

Score Letter
>= 93.00 A
90.00 - 92.99 A-
87.00 - 89.99 B+
83.00 - 86.99 B
80.00 - 82.99 B-
77.00 - 79.99 C+
73.00 - 76.99 C
70.00 - 72.99 C-
67.00 - 69.99 D+
63.00 - 66.99 D
60.00 - 62.99 D-
<= 59.99 F

Disability Statement

Vanderbilt is committed to equal opportunity for students with disabilities. If you have a physical or learning disability, you should ask the Opportunity Development Center to assist you in identifying yourself to your instructors as having a disability, so that appropriate accommodation may be provided. Without notification, your instructors assume that you have no disabilities or seek no accommodation.

Emergency Evacuation Plan

In the event of a fire or other emergency, the occupants of this class should collect their coats and personal belongings and leave the building using the stairs. VANDERBILT UNIVERSITY POLICY FORBIDS REENTRY TO A BUILDING IN WHICH AN ALARM HAS OCCURRED WITHOUT AUTHORIZATION BY VANDERBILT SECURITY. If, in consequence of a disability, you anticipate the need for assistance, please discuss that need with the instructors.

If a tornado siren is heard, please go to the nearest interior hallway or interior rooms away from windows.

Setup Guide


Please see the AWS guide for setting AWS and instances. You will receive an invitation for the class.

Google Colab

We will use google colab in the class. Start at and link your github account. Selectincl private repos. Start with for example analyses.

Accessing Homeworks

Repositories will be created for each student through github classroom. You will get a url to accept the assignment. You should see yours at<GITHUB USERNAME> 

Clone the repository to your home directory on the cluster using:

git clone<GITHUB USERNAME>.git

I may push updates to this homework assignment in the future. To setup an upstream repo, do the following:

git remote add upstream

To pull updates do the following:

git fetch upstream
git merge upstream/master

You will need to resolve conflicts if they occur.


No description, website, or topics provided.






No releases published


No packages published