Objectives

Students will walk away with introductory & practical knowledge of the Python Data Science stack:

pandas
matplotlib
scikit-learn

Setup

Go to tmpnb.org.
Select New > Python 2 to create a Python notebook.
Follow along with me:

>>> import pandas
>>> import sklearn
>>> import matplotlib

Agenda

Python Warm Up
Intro to Pandas
Intro to Data Science with Sci-kit Learn
Visualization with Matplotlib
More Practice

Python Warm-Up / Review (20)

This problem will give us a review of lists, for loops and lambda functions

Given the following list,

names = ["Michael Fassbender", "Karlie Kloss", "Taylor Swift", "Justin Bieber"]

print out the names that contain the letter "l"
turn all of the names lowercase
sort the list of names alphabetically using the built-in sorted function (HINT: Use Google)
sort the list of names by length using the built-in sorted function

Pandas (65)

Question What is pandas?

Here is our first data set. Let's download it and upload it to the datasets folder within the notebook.

Preliminary Exercise

With your partner 0. Download and open the dataset

What is this dataset about?
What are some questions you might ask about the data?

Basic Manipulation (75)

Let's read in the data.
How do we see what columns are available?
How do we look at just the head or tail of the dataset?
How do we look at only a few rows?
How do we only look at certain columns?
How do we pull out a column and look at it as a series?

Filtering DataFrames (80)

How do we look at only those rows that have Status = won
Exercise: How many accounts have a price greater than $12,000?
How do we get the maximum value of a certain column?
Exercise: What is the minimum account price? The mean? The sum? The standard deviation?

LUNCH

Aggregating data (120)

What is the total dollar amount pending?

How do we add columns?
Let's add a column called Amount that is equal to Quantity * Price
Exercise: Let's select just those rows where status is pending and sum up those amounts.

Pivot tables (130)

Question: What are pivot tables? Why are they useful?

Let's take a look at the documentation here.

Let's pivot using one index.
Let's pivot on multiple indexes
Let's reverse those indexes
Let's specify which values we care about
Let's specify which columns we want broken down
Let's specify how we want the values to be aggregated (aggfunc)
Let's fill N/A values
Let's get subtotals

(Creative) Exercise: with a partner, use pivot tables to play around with the data. What pivots do you find particularly interesting or useful for this dataset?

(Depending on Time) Solo Practice (150)

Read in this dataset
What is this dataset about?
Let's delete the Unnamed: 0 column.
Let's compute the duration by turning starttime and stopttime into datetime objects and computing their difference.
What is the average trip time? What is the minimum and maximum trip time? What is the standard deviation?
What is the average trip time by station? (Hint: Use pivot tables)

STRETCH/BIO BREAK

Intro the Data Science (180)

Think/Pair/Share: What is data science? What are some examples of datascience

Here is our data set.
Here is the description of the data.

Data Science Terminology

features
target

Examples:

spam
netflix

With a partner,

Read the data description
Discuss the data and what we could use this data for
Upload to datasets/ in the notebook and read in the data with pandas.

Together, (190)

Let's use pandas's built-in descriptive statistics method to get a statistical summary of the data.
Let's plot CRIM against MEDV
By yourself, generate the remaining 12 plots (ZN against MEDV, ..., LSTAT against MEV)
Which feature looks to be most predictive

Think/Pair/Share: What is linear regression? (It's machine learning!) (210)

More Data Science Terminology

cross validation
training set
validation set

Together, (215)

Let's separate the data into feature and target.
Let's separate the feature and target into training and validation set.
Let's fit the linear regression model using 3 columns.
Let's plot the linear regression model.
Let's plot the predictions.
Let's measure the accuracy.
Let's see which columns were most predictive.
Let's use cross_val_predict as a shortcut to get the predicted values.
Let's use cross_val_score as a shortcut for the R^2 values. What does cv do?

On your own, (230)

Run the regression using all of the feature columns.
How does the model improve/worsen?

Using Data Science for Classification (240)

Even More Data Science Terminology

regression
classification Question: What are some more examples of regression applications? classification applications?

Popular Classification Algorithms (245)

RandomForest
Logistic Regression (poorly named, I know!)
Support Vector Machines
Neural Networks (Deep Learning...)
k Nearest Neighbors

Overview of K Nearest Neighbors (250)

simple model
works well when there aren't too many different features
works well when the scale of each feature is similar (why?). we'll see this in our example.

Worksheets (260)

Dataset (275)

Here's a description of our dataset.
Here's the dataset.

Preliminary Exercises

By yourself, take 5 minutes to do the following:

Read the dataset description. What is this dataset about?
Upload the dataset to datasets/ in our notebook and read the dataset into pandas

Together (280)

Separate into feature and target
Use cross val to run KNeighborsClassifier
Plot these values of n_neighbors 2, 3, 4, 5, 10 against accuracy score. How did it do?
Let's describe the data.
Let's normalize the data using normalize
Try KNeighbors again for the different values of n_neighbors. How did it do? Which value of n_neighbors was best?
Let's manually use train_test_split and compare the predicted values with the true values in the test set to more concretely see the output of the model.

Tuning the model

Every data science model (algorithm) has parameters you can tune to improve the accuracy of the model.
For kNN, what can/did we tune?

Saving and Sharing our Notebook (310)

Download your notebook
Open it up in a text editor
Copy all the text
Paste it into a gist
Create a secret gist
Copy the browser url
Go here and paste that url
Voila!

Possible Next Steps

Build a Django app
Run through more pandas tutorials
Run through some sci-kit learn tutorials and examples
Take the GA Data Science part time course
To up your pure Python fluency, do tons of Euler problems

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Objectives

Setup

Agenda

Python Warm-Up / Review (20)

Pandas (65)

Preliminary Exercise

Basic Manipulation (75)

Filtering DataFrames (80)

Aggregating data (120)

Pivot tables (130)

(Depending on Time) Solo Practice (150)

Intro the Data Science (180)

Data Science Terminology

More Data Science Terminology

Using Data Science for Classification (240)

Even More Data Science Terminology

Popular Classification Algorithms (245)

Overview of K Nearest Neighbors (250)

Worksheets (260)

Dataset (275)

Preliminary Exercises

Together (280)

Tuning the model

Saving and Sharing our Notebook (310)

Possible Next Steps

About

Releases

Packages

suneel0101/python-for-data-science-2015

Folders and files

Latest commit

History

Repository files navigation

Objectives

Setup

Agenda

Python Warm-Up / Review (20)

Pandas (65)

Preliminary Exercise

Basic Manipulation (75)

Filtering DataFrames (80)

Aggregating data (120)

Pivot tables (130)

(Depending on Time) Solo Practice (150)

Intro the Data Science (180)

Data Science Terminology

More Data Science Terminology

Using Data Science for Classification (240)

Even More Data Science Terminology

Popular Classification Algorithms (245)

Overview of K Nearest Neighbors (250)

Worksheets (260)

Dataset (275)

Preliminary Exercises

Together (280)

Tuning the model

Saving and Sharing our Notebook (310)

Possible Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages