Topics in Big Data

Introduction

The goal of this class is to cover topics in Big Data. The focus will be on principles and practices of data storage, data modeling techniques, data processing and querying, data analytics and applications of machine learning using these systems. We will learn about application of these concepts on large scale urban analytics in cities.

Instructor

Abhishek Dubey (first name . last name at vanderbilt.edu)

My research focus is on application of big data and machine learning for creating large scale social cyber-physical systems such as transportation networks. For more details visit my group project page. https://scope-lab-vu.github.io/

Contact We will use brightspace and piazza for communication. You can also create issues on github to point to any specific problem with an example or assignment.

Office Hours Office Hours will be available on Friday at 1 PM to 2 PM at my office in Institute for Software Integrated Systems. http://www.isis.vanderbilt.edu/contact

I am also available by appointment as required. Send me an email if you need to meet with me outside the office hours.

Teaching Assistant

Nithin Guruswamy [Office Hours TBD]

Graders

Matthew Buruss
Thomas Mallick

Course Expectation

In the course we will be heavily using Python for programming excercises and analysis. We will also be using google collaboratory and Amazon Web Services. Knowledge of python is required. You are also expected to know how to use github and clone repositories.

Reading Material

Most lectures I will assign reading material. You are expected to read it before the next class. The reading material that you should finish before lecture_{i+1} will be sent as reading list in lecture_{i} folder. See 01-introduction/reading for example.

Topics to be covered

Applications of Big Data
History of Database and Big Data Systems
Big Data Infrastructure
- Computing Clusters.
- Understanding the database anatomy and optimizing access
- Online transactions
- Understanding NoSQL
- Column storage vs Row Storage
- Distributed System Resilience - Zookeeper, HashiCorp Consul, System-d and Google Chubby, Paxos and Consensus
- Storage Produts: Big Table, Dynamo DB, Spanner, Memcached
Computation Models and Big Data Processing (Batch Data)
- Classical Workflow Systems
- Map Reduce and HDFS
- Spark and RDD
Computation Models and Big Data Processing (Streaming Data)
- Pulsar and Kafka - Data Collection and Management (Pub/Sub systems)
- Storm and Heron
Analytics
- Clustering and Dimensionality Reduction
- Link Analysis and Page Rank
- Large Scale Machine Learning
Practical Applications (Projects)
- City Scooter Data Analysis
- City Accident Data Analysis
- Recommender Systems
- Transit Energy systems

Exams, Quizzes, Assignments and projects

4 take home quizzes.
4 pop quizzes. They are primarily used to judge class participation. Unless there is a good reason (it is important to inform me beforehand) there will be no makeup.
5 programming assignments.
There will be a mid-term exam on 2/26/2020
The final exam will be replaced by a final project and summary report to be presented and submitted by the end of the day of the final class.

No collaborations unless explicitly permitted.

The Vanderbilt Honor Code will govern work done. ANY VIOLATIONS WILL RESULT in the case to be reported to the honor council. You are welcome to refer to the online sources for your assignments. However, you must not copy the code and must provide citation of the source of inspiration. All work will be submitted via github.

Evaluation

The following grading criteria are tentative and are subject to change. Each graded item in this course will be assigned a certain number of points. Your final grade will be computed as the total number of points you achieved divided by the number of points possible. The instructor reserves the right to apply a curve to the final result.

Grading Criteria

Category	Percentage
Programming Assignments	30%
Class Pop Quizzes	10%
Take Home Quizzes	20%
Mid Term Exam	20%
Final Project	20%

Course Policies

Submissions will be due by midnight on the day mentioned in the assignment and homework description. Late submissions will be penalized with an automatic 20 percent penalty per day (applied relative to the graded score for the submission).

Letter Grade Distribution

Score	Letter
>= 93.00	A
90.00 - 92.99	A-
87.00 - 89.99	B+
83.00 - 86.99	B
80.00 - 82.99	B-
77.00 - 79.99	C+
73.00 - 76.99	C
70.00 - 72.99	C-
67.00 - 69.99	D+
63.00 - 66.99	D
60.00 - 62.99	D-
<= 59.99	F

Disability Statement

Vanderbilt is committed to equal opportunity for students with disabilities. If you have a physical or learning disability, you should ask the Opportunity Development Center to assist you in identifying yourself to your instructors as having a disability, so that appropriate accommodation may be provided. Without notification, your instructors assume that you have no disabilities or seek no accommodation.

Emergency Evacuation Plan

In the event of a fire or other emergency, the occupants of this class should collect their coats and personal belongings and leave the building using the stairs. VANDERBILT UNIVERSITY POLICY FORBIDS REENTRY TO A BUILDING IN WHICH AN ALARM HAS OCCURRED WITHOUT AUTHORIZATION BY VANDERBILT SECURITY. If, in consequence of a disability, you anticipate the need for assistance, please discuss that need with the instructors.

If a tornado siren is heard, please go to the nearest interior hallway or interior rooms away from windows.

Setup Guide

AWS

Please see the AWS guide for setting AWS and instances. You will receive an invitation for the class.

Google Colab

We will use google colab in the class. Start at https://colab.research.google.com/github/ and link your github account. Selectincl private repos. Start with https://github.com/vu-bigdata-2020/example-notebooks for example analyses.

Accessing Homeworks

Repositories will be created for each student through github classroom. You will get a url to accept the assignment. You should see yours at

https://github.com/vu-bigdata-2020/homework-2-<GITHUB USERNAME>

Clone the repository to your home directory on the cluster using:

git clone https://github.com/vu-bigdata-2020/homework-2-<GITHUB USERNAME>.git

I may push updates to this homework assignment in the future. To setup an upstream repo, do the following:

git remote add upstream https://github.com/vu-bigdata-2020/homework-2.git

To pull updates do the following:

git fetch upstream
git merge upstream/master

You will need to resolve conflicts if they occur.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
00-aws-setup-guide		00-aws-setup-guide
01-introduction		01-introduction
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topics in Big Data

Introduction

Instructor

Teaching Assistant

Graders

Course Expectation

Reading Material

Topics to be covered

Exams, Quizzes, Assignments and projects

No collaborations unless explicitly permitted.

Evaluation

Grading Criteria

Course Policies

Letter Grade Distribution

Disability Statement

Emergency Evacuation Plan

Setup Guide

AWS

Google Colab

Accessing Homeworks

About

Releases

Packages

zhongsanqiang/lectures

Folders and files

Latest commit

History

Repository files navigation

Topics in Big Data

Introduction

Instructor

Teaching Assistant

Graders

Course Expectation

Reading Material

Topics to be covered

Exams, Quizzes, Assignments and projects

No collaborations unless explicitly permitted.

Evaluation

Grading Criteria

Course Policies

Letter Grade Distribution

Disability Statement

Emergency Evacuation Plan

Setup Guide

AWS

Google Colab

Accessing Homeworks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages