Skip to content

Latest commit

 

History

History
432 lines (299 loc) · 25 KB

README.md

File metadata and controls

432 lines (299 loc) · 25 KB

General Assembly Data Science Class

6/14/2016 to 8/18/2016

Instructor: Hamed Hasheminia

Tuesdays Thursdays
6/14: Data Science - Introduction Part I 6/16 Data Science - Introduction Part II
6/21: Linear Regression Lines Part I 6/23: Linear Regression Lines Part II
6/28: Model Selection 6/30: Missing Data and Imputation
7/5: K-Nearest Neighbors 7/7: Logistic Regression Part I
7/12: Logistic Regression Part II 7/14: In Class Project
7/19: Tree-Based Models Part I 7/21: Tree-Based Models Part II
7/26: Natural Language Processing 7/28: Time Series Models
8/2: Principal Component Analysis 8/4: Data Visualization
8/9: Naive Bayes 8/11: Course Review
8/16: Final Project Presentations I 8/18: Final Project Presentations II

##Lecture 1 Summary (Data Science - Introduction Part I)

  • Data Science - meaning
  • Continuous, Discrete and Qualitative Data
  • Supervised vs Unsupervised Learning
  • Classification vs Regression
  • Time series vs cross-sectional data
  • Numpy
  • Pandas

Resources

Set up GitHub - Self-study guide

Pre-work for second lecture

Additional Resources

Lecture 2 Summary (Data Science Intorduction - Part II)

  • Measures of central tendency (Mean, Median, Mode, Quartiles, Percentiles)
  • Measures of Variability (IQR, Standard Deviation, Variance)
  • Skewness Coefficient
  • Boxplots, Histograms, Scatterplots
  • Central Limit Theorem
  • Class/Dummy Variables
  • Walkthrough describing and visualizing data in Pandas

Resources

HW 1 is Assigned

  • Please read and follow instructions from readme
  • This homework is due on June 23rd, 2016 at 6:30PM

Additional Resources

  • Here you can find valuable resources for matplotlib
  • A good Video on Centeral Limit Theorem

Lecture 3 Summary (Linear Regression Lines - Part I)

  • Linear Regression lines
  • Single Variable and Multi-Variable Regression Lines
  • Capture non-linearity using Linear Regression lines.
  • Interpretting regression coefficients
  • Dealing with dummy variables in regression lines
  • intro on sklearn and searborn library

Resources

Additional Resources

Lecture 4 Summary (Linear Regression Lines - Part II)

  • Hypothesis test - test of significance on regression coefficients
  • p-values
  • Capture non-linearity using Linear Regression lines.
  • R-squared
  • Interaction Effects

Resources

Additional Resources

HW 2 is Assigned

  • Please read and follow instructions from readme
  • Here you can find iPython notebook of your 2nd assignment.
  • This homework is due on June 30th, 2016 at 6:30PM

Lecture 5 Summary (Model Selection)

  • Bias-Variance Trade off
  • Validation (Test vs Train set)
  • Cross-Validation
  • Ridge and Lasso Regression
  • (Optional) Backward Selection, Forward Selection, All Subset Selection. (If you want to use these methods you need to use R)

Resources

Additional Resources

  • Preprocessing Library
  • Cross-Validation Library
  • This is an excellent book. You can find theory of Cross-Validation in Chapter 5. You can also learn about Lasso and Ridge regression in Chapter 6 of the mentioend textbook.
  • Here you can find my video on Cross-Validation
  • Here you can find my video on Ridge and Lasso Regression
  • Here you can find my video on Best subset selection.

Lecture 6 Summary (Missing Data and Imputation)

  • Types of missing data (MCAR, MAR, NMAR)
  • Single imputation and their limitations
  • Imuptation using regression lines and error
  • Hot deck imputation
  • multiple imputation

Resources

Additional Resources

  • Great Video by Dr. Elizabeth A. Stuart from John Hopkins University

Announcements

  • HW 3 is assigned (Due at 6:30PM - July 7th)
  • Please read this before starting your assignment.
  • HW3 starter code can be found here

Lecture 7 Summary (K-Nearest Neighbors)

  • Classification Problems
  • Misclassifciation Error
  • KNN algorithm for Classification
  • Cross-Validation for KNN Algorithm
  • Limitations of KNN Algorithm
  • KNN algorithm for Regression

Resources

Announcements

Lecture 8 Summary (Logistic Regression Part I)

  • Logistic Regression - Intro
  • Odds vs Probability
  • Using Logistic Regression to Make predictions
  • How one interprets coefficients of a Logistic Regression model
  • Strength and weaknesses of Logistic Regression Model

Resources

Additional Resources

  • Logistic Regression video

HW 3 Solutions Posted

Lecture 9 Summary (Logistic Regression Part II)

  • Unbalanced observations and Logistic Regression
  • FP/FN/TP/TN/FPR/TPR
  • The effect of changing Threshhold
  • ROC Curves
  • Area Under Curve
  • How to compare classifciation algorithms

Resources

Lecture 10 (In-Class Projects)

Lecture 11 Summary (Tree-Based Models - part I)

  • Decision Tree for Regression
  • Greedy Approach
  • Decision Tree for Classification
  • Gini Index and Entropy index
  • Limitation of Simple Decision Trees

Resources

Additional Resources

Lecture 12 Summary Tree-Based Models - part II)

  • Bagging
  • Random Forest
  • Boosting
  • Tuning parameters for boosting and Random Forest

Resources

Additional Resources

Announcement

  • HW 4 is assigned and is due on July 28th 2016 at 6:30PM.
  • Please read ReadMe file before working on your project.

Lecture 13 Summary (Natural Language Processing)

  • Definition of Natural Language Processing
  • NLP applications
  • Basic NLP practice
  • Stop words, bag-of-words, TF-DIF

Resources

Additional Resources

Pre-Work

Lecture 14 Summary (Principal Component Analysis)

  • Principal Component Analysis
  • Computation of PCAs
  • Geometry of PCAs
  • Proportion of Variance Explained

Resources

Additional Resources

Lecture 15 Summary (Time Series Models)

  • AutoRegressive Models
  • Moving Averages
  • ARMA
  • ARIMA

Resources

Additional Resources

Lecture 16 Visualization - Thanks to Karla and Josh)

Lecture 17 Summary (Naive Bayes)

Summary

  • Naive Bayes Algorithm introduced
  • Guassian NB
  • Bernoulli NB
  • Multinomial NB
  • Advantages and Disadvantages of using NB

Resources

Additional Resources

Lecture 18 Summary (Wrap Up)

Summary

  • Materials that are covered in previous lectures were reviewed
  • We discussed when to use which model
  • roadmap of future training and self-study

Resources