GitHub - seeess1/bigData: Exercises in processing big data

Big Data

Screenshot from Pyspark script analyzing SAT Math performance in NYC based on access to different subway lines. The snapshot shows the RDD transformations used to find the top 5 subway lines in NYC with the highest mean SAT Math scores.

Overview

This repo contains exercises and projects I completed for Big Data Management at NYU in the Spring of 2019. We used MapReduce and PySpark to stream massive datasets and find patterns in the infamous Enron email data set, NYC taxi data, and Twitter.

Also check out my repos nyuProjects and machineLearning for examples of other work with scikit-learn, pandas, and SQL.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
README.md		README.md
higher_order_functions.ipynb		higher_order_functions.ipynb
spark.ipynb		spark.ipynb
spark_geo.py		spark_geo.py
spark_twitter.py		spark_twitter.py
streaming.py		streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data

Overview

About

Releases

Packages

Languages

seeess1/bigData

Folders and files

Latest commit

History

Repository files navigation

Big Data

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages