Students will walk away with introductory & practical knowledge of the Python Data Science stack:
- pandas
- matplotlib
- scikit-learn
- Go to tmpnb.org.
- Select
New
>Python 2
to create a Python notebook. - Follow along with me:
>>> import pandas
>>> import sklearn
>>> import matplotlib
- Python Warm Up
- Intro to Pandas
- Intro to Data Science with Sci-kit Learn
- Visualization with Matplotlib
- More Practice
This problem will give us a review of lists, for loops and lambda functions
Given the following list,
names = ["Michael Fassbender", "Karlie Kloss", "Taylor Swift", "Justin Bieber"]
- print out the names that contain the letter "l"
- turn all of the names lowercase
- sort the list of names alphabetically using the built-in
sorted
function (HINT: Use Google) - sort the list of names by length using the built-in
sorted
function
Question What is pandas
?
Here is our first data set. Let's download it and upload it to the datasets folder within the notebook.
With your partner 0. Download and open the dataset
- What is this dataset about?
- What are some questions you might ask about the data?
- Let's read in the data.
- How do we see what columns are available?
- How do we look at just the head or tail of the dataset?
- How do we look at only a few rows?
- How do we only look at certain columns?
- How do we pull out a column and look at it as a series?
- How do we look at only those rows that have Status = won
- Exercise: How many accounts have a price greater than $12,000?
- How do we get the maximum value of a certain column?
- Exercise: What is the minimum account price? The mean? The sum? The standard deviation?
LUNCH
What is the total dollar amount pending?
- How do we add columns?
- Let's add a column called Amount that is equal to Quantity * Price
- Exercise: Let's select just those rows where status is pending and sum up those amounts.
Question: What are pivot tables? Why are they useful?
Let's take a look at the documentation here.
- Let's pivot using one index.
- Let's pivot on multiple indexes
- Let's reverse those indexes
- Let's specify which values we care about
- Let's specify which columns we want broken down
- Let's specify how we want the values to be aggregated (
aggfunc
) - Let's fill N/A values
- Let's get subtotals
(Creative) Exercise: with a partner, use pivot tables to play around with the data. What pivots do you find particularly interesting or useful for this dataset?
- Read in this dataset
- What is this dataset about?
- Let's delete the Unnamed: 0 column.
- Let's compute the duration by turning starttime and stopttime into datetime objects and computing their difference.
- What is the average trip time? What is the minimum and maximum trip time? What is the standard deviation?
- What is the average trip time by station? (Hint: Use pivot tables)
STRETCH/BIO BREAK
Think/Pair/Share: What is data science? What are some examples of datascience
- features
- target
Examples:
- spam
- netflix
With a partner,
- Read the data description
- Discuss the data and what we could use this data for
- Upload to datasets/ in the notebook and read in the data with pandas.
Together, (190)
- Let's use pandas's built-in descriptive statistics method to get a statistical summary of the data.
- Let's plot CRIM against MEDV
- By yourself, generate the remaining 12 plots (ZN against MEDV, ..., LSTAT against MEV)
- Which feature looks to be most predictive
Think/Pair/Share: What is linear regression? (It's machine learning!) (210)
- cross validation
- training set
- validation set
Together, (215)
- Let's separate the data into feature and target.
- Let's separate the feature and target into training and validation set.
- Let's fit the linear regression model using 3 columns.
- Let's plot the linear regression model.
- Let's plot the predictions.
- Let's measure the accuracy.
- Let's see which columns were most predictive.
- Let's use
cross_val_predict
as a shortcut to get the predicted values. - Let's use
cross_val_score
as a shortcut for the R^2 values. What doescv
do?
On your own, (230)
- Run the regression using all of the feature columns.
- How does the model improve/worsen?
- regression
- classification Question: What are some more examples of regression applications? classification applications?
- RandomForest
- Logistic Regression (poorly named, I know!)
- Support Vector Machines
- Neural Networks (Deep Learning...)
- k Nearest Neighbors
- simple model
- works well when there aren't too many different features
- works well when the scale of each feature is similar (why?). we'll see this in our example.
By yourself, take 5 minutes to do the following:
- Read the dataset description. What is this dataset about?
- Upload the dataset to datasets/ in our notebook and read the dataset into pandas
- Separate into feature and target
- Use cross val to run
KNeighborsClassifier
- Plot these values of n_neighbors 2, 3, 4, 5, 10 against accuracy score. How did it do?
- Let's describe the data.
- Let's normalize the data using
normalize
- Try KNeighbors again for the different values of n_neighbors. How did it do? Which value of n_neighbors was best?
- Let's manually use
train_test_split
and compare the predicted values with the true values in the test set to more concretely see the output of the model.
- Every data science model (algorithm) has parameters you can tune to improve the accuracy of the model.
- For kNN, what can/did we tune?
- Download your notebook
- Open it up in a text editor
- Copy all the text
- Paste it into a gist
- Create a secret gist
- Copy the browser url
- Go here and paste that url
- Voila!
- Build a Django app
- Run through more pandas tutorials
- Run through some sci-kit learn tutorials and examples
- Take the GA Data Science part time course
- To up your pure Python fluency, do tons of Euler problems