The goal of this class is to cover topics in Big Data. The focus will be on principles and practices of data storage, data modeling techniques, data processing and querying, data analytics and applications of machine learning using these systems. We will learn about application of these concepts on large scale urban analytics in cities.
Abhishek Dubey (first name . last name at vanderbilt.edu)
My research focus is on application of big data and machine learning for creating large scale social cyber-physical systems such as transportation networks. For more details visit my group project page. https://scope-lab-vu.github.io/
Contact We will use brightspace and piazza for communication. You can also create issues on github to point to any specific problem with an example or assignment.
Office Hours Office Hours will be available on Friday at 1 PM to 2 PM at my office in Institute for Software Integrated Systems. http://www.isis.vanderbilt.edu/contact
I am also available by appointment as required. Send me an email if you need to meet with me outside the office hours.
Nithin Guruswamy [Office Hours TBD]
- Matthew Buruss
- Thomas Mallick
In the course we will be heavily using Python for programming excercises and analysis. We will also be using google collaboratory and Amazon Web Services. Knowledge of python is required. You are also expected to know how to use github and clone repositories.
Most lectures I will assign reading material. You are expected to read it before the next class. The reading material that you should finish before lecture_{i+1} will be sent as reading list in lecture_{i} folder. See 01-introduction/reading for example.
- Applications of Big Data
- History of Database and Big Data Systems
- Big Data Infrastructure
- Computing Clusters.
- Understanding the database anatomy and optimizing access
- Online transactions
- Understanding NoSQL
- Column storage vs Row Storage
- Distributed System Resilience - Zookeeper, HashiCorp Consul, System-d and Google Chubby, Paxos and Consensus
- Storage Produts: Big Table, Dynamo DB, Spanner, Memcached
- Computation Models and Big Data Processing (Batch Data)
- Classical Workflow Systems
- Map Reduce and HDFS
- Spark and RDD
- Computation Models and Big Data Processing (Streaming Data)
- Pulsar and Kafka - Data Collection and Management (Pub/Sub systems)
- Storm and Heron
- Analytics
- Clustering and Dimensionality Reduction
- Link Analysis and Page Rank
- Large Scale Machine Learning
- Practical Applications (Projects)
- City Scooter Data Analysis
- City Accident Data Analysis
- Recommender Systems
- Transit Energy systems
- 4 take home quizzes.
- 4 pop quizzes. They are primarily used to judge class participation. Unless there is a good reason (it is important to inform me beforehand) there will be no makeup.
- 5 programming assignments.
- There will be a mid-term exam on 2/26/2020
- The final exam will be replaced by a final project and summary report to be presented and submitted by the end of the day of the final class.
The Vanderbilt Honor Code will govern work done. ANY VIOLATIONS WILL RESULT in the case to be reported to the honor council. You are welcome to refer to the online sources for your assignments. However, you must not copy the code and must provide citation of the source of inspiration. All work will be submitted via github.
The following grading criteria are tentative and are subject to change. Each graded item in this course will be assigned a certain number of points. Your final grade will be computed as the total number of points you achieved divided by the number of points possible. The instructor reserves the right to apply a curve to the final result.
Category | Percentage |
---|---|
Programming Assignments | 30% |
Class Pop Quizzes | 10% |
Take Home Quizzes | 20% |
Mid Term Exam | 20% |
Final Project | 20% |
Submissions will be due by midnight on the day mentioned in the assignment and homework description. Late submissions will be penalized with an automatic 20 percent penalty per day (applied relative to the graded score for the submission).
Score | Letter |
---|---|
>= 93.00 | A |
90.00 - 92.99 | A- |
87.00 - 89.99 | B+ |
83.00 - 86.99 | B |
80.00 - 82.99 | B- |
77.00 - 79.99 | C+ |
73.00 - 76.99 | C |
70.00 - 72.99 | C- |
67.00 - 69.99 | D+ |
63.00 - 66.99 | D |
60.00 - 62.99 | D- |
<= 59.99 | F |
Vanderbilt is committed to equal opportunity for students with disabilities. If you have a physical or learning disability, you should ask the Opportunity Development Center to assist you in identifying yourself to your instructors as having a disability, so that appropriate accommodation may be provided. Without notification, your instructors assume that you have no disabilities or seek no accommodation.
In the event of a fire or other emergency, the occupants of this class should collect their coats and personal belongings and leave the building using the stairs. VANDERBILT UNIVERSITY POLICY FORBIDS REENTRY TO A BUILDING IN WHICH AN ALARM HAS OCCURRED WITHOUT AUTHORIZATION BY VANDERBILT SECURITY. If, in consequence of a disability, you anticipate the need for assistance, please discuss that need with the instructors.
If a tornado siren is heard, please go to the nearest interior hallway or interior rooms away from windows.
Please see the AWS guide for setting AWS and instances. You will receive an invitation for the class.
We will use google colab in the class. Start at https://colab.research.google.com/github/ and link your github account. Selectincl private repos. Start with https://github.com/vu-bigdata-2020/example-notebooks for example analyses.
Repositories will be created for each student through github classroom. You will get a url to accept the assignment. You should see yours at
https://github.com/vu-bigdata-2020/homework-2-<GITHUB USERNAME>
Clone the repository to your home directory on the cluster using:
git clone https://github.com/vu-bigdata-2020/homework-2-<GITHUB USERNAME>.git
I may push updates to this homework assignment in the future. To setup an upstream repo, do the following:
git remote add upstream https://github.com/vu-bigdata-2020/homework-2.git
To pull updates do the following:
git fetch upstream
git merge upstream/master
You will need to resolve conflicts if they occur.