Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes made to enable delivery of CeR - ML Carpentries - August 2024 #20

Merged
merged 8 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 36 additions & 28 deletions _episodes/04-ensemble-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,25 +166,35 @@ plt.show()

![random forest clf space](../fig/EM_rf_clf_space.png)

There is still some overfitting indicated by the regions that contain only single points but using the same hyper-parameter settings used to fit the decision tree classifier, we can see that overfitting is reduced.
There is still some overfitting indicated by the regions that contain only single points but using the same hyper-parameter settings used to fit the decision tree classifier, we can see that overfitting is reduced.

## Stacking a regression problem

We've had a look at a bagging approach but we'll now take a look at a stacking approach and apply it to a regression problem. We'll also introduce a new dataset to play around with.
We've had a look at a bagging approach, but we'll now take a look at a stacking approach and apply it to a regression problem. We'll also introduce a new dataset to play around with.

### The diabetes dataset
The diabetes dataset, contains 10 baseline variables for 442 diabetes patients where the target attribute is quantitative measure of disease progression one year after baseline. For more information see [Efron et al., (2004)](https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf). The useful thing about this data it is available as part of the [sci-kit learn library](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). We'll start by loading the dataset to very briefly inspect the attributes by printing them out.
### California house price prediction
The California housing dataset for regression problems contains 8 training features such as, Median Income, House Age, Average Rooms, Average Bedrooms etc. for 20,640 properties. The target variable is the median house value for those 20,640 properties, note that all prices are in units of $100,000. This toy dataset is available as part of the [scikit learn library](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). We'll start by loading the dataset to very briefly inspect the attributes by printing them out.

~~~
from sklearn.datasets import load_diabetes
import sklearn
from sklearn.datasets import fetch_california_housing

print(load_diabetes())
# load the dataset
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

## All price variables are in units of $100,000
print(X.shape)
print(X.head())

print("Housing price as the target: ")

## Target is in units of $100,000
print(y.head())
print(y.shape)
~~~
{: .language-python}

For more details on this SKLearn dataset see [this link for details.](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)

For the the purposes of learning how to create and use ensemble methods we are about to commit a cardinal sin of machine learning and blindly use this dataset without inspecting it any further.
For the the purposes of learning how to create and use ensemble methods and since it is a toy dataset, we will blindly use this dataset without inspecting it, cleaning or pre-processing it further.

> ## Exercise: Investigate and visualise the dataset
> For this episode we simply want to learn how to build and use an Ensemble rather than actually solve a regression problem. To build up your skills as an ML practitioner, investigate and visualise this dataset. What can you say about the dataset itself, and what can you summarise about about any potential relationships or prediction problems?
Expand All @@ -193,12 +203,9 @@ For the the purposes of learning how to create and use ensemble methods we are a
Lets start by splitting the dataset into training and testing subsets:

~~~
# split into train and test sets, We are selecting an 80%-20% train-test split.
from sklearn.model_selection import train_test_split

# load in data
X, y = load_diabetes(return_X_y=True)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

print(f'train size: {X_train.shape}')
Expand All @@ -210,6 +217,10 @@ Lets stack a series of regression models. In the same way the RandomForest class

We'll apply a Voting regressor to a random forest, gradient boosting and linear regressor.

Lets stack a series of regression models. In the same way the RandomForest classifier derives a results from a series of trees, we will combine the results from a series of different models in our stack. This is done using what's called an ensemble meta-estimator called a VotingRegressor.

We'll apply a Voting regressor to a random forest, gradient boosting and linear regressor.

> ## But wait, aren't random forests/decision tree for classification problems?
> Yes they are, but quite often in machine learning various models can be used to solve both regression and classification problems.
>
Expand All @@ -232,24 +243,23 @@ from sklearn.ensemble import (
)
from sklearn.linear_model import LinearRegression

# training estimators
# Initialize estimators
rf_reg = RandomForestRegressor(random_state=5)
gb_reg = GradientBoostingRegressor(random_state=5)
linear_reg = LinearRegression()
voting_reg = VotingRegressor([("gb", rf_reg), ("rf", gb_reg), ("lr", linear_reg)])
voting_reg = VotingRegressor([("rf", rf_reg), ("gb", gb_reg), ("lr", linear_reg)])

# fit voting estimator
# fit/train voting estimator
voting_reg.fit(X_train, y_train)

# lets also train the individual models for comparison
# lets also fit/train the individual models for comparison
rf_reg.fit(X_train, y_train)
gb_reg.fit(X_train, y_train)
linear_reg.fit(X_train, y_train)

~~~
{: .language-python}

We fit the voting regressor in the same way we would fit a single model. When the voting regressor is instantiated we pass it a parameter containing a list of tuples that contain the estimators we wish to stack: in this case the random forest, gradient boosting and linear regressors. To get a sense of what this is doing lets predict the first 20 samples in the test portion of the data and plot the results.
We fit the voting regressor in the same way we would fit a single model. When the voting regressor is instantiated we pass it a parameter containing a list of tuples that contain the estimators we wish to stack: in this case the random forest, gradient boosting and linear regressors. To get a sense of what this is doing lets predict the first 20 samples in the test portion of the data and plot the results.

~~~
import matplotlib.pyplot as plt
Expand All @@ -263,9 +273,9 @@ linear_pred = linear_reg.predict(X_test_20)
voting_pred = voting_reg.predict(X_test_20)

plt.figure()
plt.plot(rf_pred, "o", color="navy", label="GradientBoostingRegressor")
plt.plot(gb_pred, "o", color="blue", label="RandomForestRegressor")
plt.plot(linear_pred, "o", color="skyblue", label="LinearRegression")
plt.plot(gb_pred, "o", color="black", label="GradientBoostingRegressor")
plt.plot(rf_pred, "o", color="blue", label="RandomForestRegressor")
plt.plot(linear_pred, "o", color="green", label="LinearRegression")
plt.plot(voting_pred, "x", color="red", ms=10, label="VotingRegressor")

plt.tick_params(axis="x", which="both", bottom=False, top=False, labelbottom=False)
Expand All @@ -278,10 +288,9 @@ plt.show()
~~~
{: .language-python}

![Regressor predictions and average from stack](../fig/EM_stacked_plot.png)

![Regressor predictions and average from stack](../fig/house_price_voting_regressor.svg)

FInally, lets see how the average compares against each single estimator in the stack?
Finally, lets see how the average compares against each single estimator in the stack?

~~~
print(f'random forest: {rf_reg.score(X_test, y_test)}')
Expand All @@ -294,11 +303,10 @@ print(f'voting regressor: {voting_reg.score(X_test, y_test)}')
~~~
{: .language-python}

Each of our models score a pretty poor 0.52-0.53, which is barely better than a coin flip. However what we can see is that the stacked result generated by the voting regressor produces a slightly improved score of 0.55, which is better than any of the three models/estimators taken individually. The whole model is greater than the sum of the individual parts. And of course, we could try and improve our accuracy score by tweaking with our indivdual model hyperparameters, or adjusting our training data features and train-test-split data.

Each of our models score between 0.61-0.82, which at the high end is good, but at the low end is a pretty poor prediction accuracy score. Do note that the toy datasets are not representative of real world data. However what we can see is that the stacked result generated by the voting regressor fits different sub-models and then averages the individual predictions to form a final prediction. The benefit of this approach is that, it reduces overfitting and increases generalizability. Of course, we could try and improve our accuracy score by tweaking with our indivdual model hyperparameters, using more advaced boosted models or adjusting our training data features and train-test-split data.

> ## Exercise: Stacking a classification problem.
> Sci-kit learn also has method for stacking ensemble classifiers ```sklearn.ensemble.VotingClassifier``` do you think you could apply a stack to the penguins dataset using a random forest, SVM and decision tree classifier, or a selection of any other classifier estimators available in sci-kit learn?
> Scikit learn also has method for stacking ensemble classifiers ```sklearn.ensemble.VotingClassifier``` do you think you could apply a stack to the penguins dataset using a random forest, SVM and decision tree classifier, or a selection of any other classifier estimators available in sci-kit learn?
>
> ~~~
> penguins = sns.load_dataset('penguins')
Expand Down
93 changes: 41 additions & 52 deletions _episodes/06-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,40 @@ The MNIST dataset contains 70,000 images of handwritten numbers, and are labelle
To make this episode a bit less computationally intensive, the Scikit-Learn example that we will work with is a smaller sample of 1797 images. Each image is 8x8 in size for a total of 64 pixels per image, resulting in 64 features for us to work with. The pixels can take a value between 0-15 (4bits). Let's retrieve and inspect the Scikit-Learn dataset with the following code:

~~~
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cluster as skl_cluster
from sklearn import manifold, decomposition, datasets

# Let's define these here to avoid repetitive code
def plots(x_manifold):
tx = x_manifold[:, 0]
ty = x_manifold[:, 1]

# without labels
fig = plt.figure(1, figsize=(4, 4))
plt.scatter(tx, ty, edgecolor='k',label=labels)
plt.show()

def plot_clusters(x_manifold, clusters):
tx = x_manifold[:, 0]
ty = x_manifold[:, 1]
fig = plt.figure(1, figsize=(4, 4))
plt.scatter(tx, ty, s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()

def plot_clusters_labels(x_manifold, labels):
tx = x_manifold[:, 0]
ty = x_manifold[:, 1]

# with labels
fig = plt.figure(1, figsize=(5, 4))
plt.scatter(tx, ty, c=labels, cmap="nipy_spectral",
edgecolor='k', label=labels)
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.show()

# load in dataset as a Pandas Dataframe, return X and Y
features, labels = datasets.load_digits(return_X_y=True, as_frame=True)
Expand All @@ -49,8 +82,6 @@ As humans we are pretty good at object and pattern recognition. We can look at t
>
> > ## Solution
> > ~~~
> > import matplotlib.pyplot as plt
> > import numpy as np
> >
> > print(features.iloc[0])
> > image_1D = features.iloc[0]
Expand Down Expand Up @@ -107,12 +138,9 @@ For more in depth explanations of PCA please see the following links:
Let's apply PCA to the MNIST dataset and retain the two most-major components:

~~~
from sklearn import decomposition

# PCA with 2 components
pca = decomposition.PCA(n_components=2)
pca.fit(features)
x_pca = pca.transform(features)
x_pca = pca.fit_transform(features)

print(x_pca.shape)
~~~
Expand All @@ -121,16 +149,7 @@ print(x_pca.shape)
This returns us an array of 1797x2 where the 2 remaining columns(our new "features" or "dimensions") contain vector representations of the first principle components (column 0) and second principle components (column 1) for each of the images. We can plot these two new features against each other:

~~~
import numpy as np
import matplotlib.pyplot as plt

tx = x_pca[:, 0]
ty = x_pca[:, 1]

# without labels
fig = plt.figure(1, figsize=(4, 4))
plt.scatter(tx, ty, edgecolor='k',label=labels)
plt.show()
plots(x_pca)
~~~
{: .language-python}

Expand All @@ -139,18 +158,10 @@ plt.show()
We now have a 2D representation of our 64D dataset that we can work with instead. Let's try some quick K-means clustering on our 2D representation of the data. Because we already have some knowledge about our data we can set `k=10` for the 10 digits present in the dataset.

~~~
import sklearn.cluster as skl_cluster

Kmean = skl_cluster.KMeans(n_clusters=10)

Kmean.fit(x_pca)
clusters = Kmean.predict(x_pca,labels)

fig = plt.figure(1, figsize=(4, 4))
plt.scatter(tx, ty, s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()
plot_clusters(x_pca, clusters)
~~~
{: .language-python}

Expand All @@ -159,11 +170,7 @@ plt.show()
And now we can compare how these clusters look against our actual image labels by colour coding our first scatter plot:

~~~
fig = plt.figure(1, figsize=(5, 4))
plt.scatter(tx, ty, c=labels, cmap="nipy_spectral",
edgecolor='k',label=labels)
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.show()
plot_clusters_labels(x_pca, labels)
~~~
{: .language-python}

Expand All @@ -186,45 +193,27 @@ For more in depth explanations of t-SNE and manifold learning please see the fol
Scikit-Learn allows us to apply t-SNE in a relatively simple way. Lets code and apply t-SNE to the MNIST dataset in the same manner that we did for the PCA example, and reduce the data down from 64D to 2D again:

~~~
from sklearn import manifold

# t-SNE embedding
# initialising with "pca" explicitly preserves global structure
tsne = manifold.TSNE(n_components=2, init='pca', random_state = 0)
x_tsne = tsne.fit_transform(features)


fig = plt.figure(1, figsize=(4, 4))
plt.scatter(x_tsne[:, 0], x_tsne[:, 1], edgecolor='k')
plt.show()
plots(x_tsne)
~~~
{: .language-python}

![Reduction using PCA](../fig/tsne_unlabelled.png)


It looks like t-SNE has done a much better job of splitting our data up into clusters using only a 2D representation of the data. Once again, let's run a simple k-means clustering on this new 2D representation, and compare with the actual color-labelled data:

~~~
import sklearn.cluster as skl_cluster

Kmean = skl_cluster.KMeans(n_clusters=10)

Kmean.fit(x_tsne)
clusters = Kmean.predict(x_tsne,labels)

fig = plt.figure(1, figsize=(4, 4))
plt.scatter(x_tsne[:,0], x_tsne[:,1], s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()

# with labels
fig = plt.figure(1, figsize=(5, 4))
plt.scatter(x_tsne[:, 0], x_tsne[:, 1], c=labels, cmap="nipy_spectral",
edgecolor='k',label=labels)
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.show()
plot_clusters(x_tsne, clusters)
plot_clusters_labels(x_tsne, labels)
~~~
{: .language-python}

Expand Down
89 changes: 89 additions & 0 deletions _episodes/ensemble_classification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
## Stacking: classification
import seaborn as sns
penguins = sns.load_dataset('penguins')

feature_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
penguins.dropna(subset=feature_names, inplace=True)

species_names = penguins['species'].unique()

# Define data and targets
X = penguins[feature_names]

y = penguins.species

# Split data in training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

print(f'train size: {X_train.shape}')
print(f'test size: {X_test.shape}')

from sklearn.ensemble import (
GradientBoostingClassifier,
RandomForestClassifier,
VotingClassifier,
)
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier

# training estimators
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_leaf=1, random_state=5)
gb_clf = GradientBoostingClassifier(random_state=5)
gp_clf = GaussianProcessClassifier(1.0 * RBF(1.0), random_state=5)
dt_clf = DecisionTreeClassifier(max_depth=5, random_state=5)

voting_reg = VotingClassifier([("rf", rf_clf), ("gb", gb_clf), ("gp", gp_clf), ("dt", dt_clf)])

# fit voting estimator
voting_reg.fit(X_train, y_train)

# lets also train the individual models for comparison
rf_clf.fit(X_train, y_train)
gb_clf.fit(X_train, y_train)
gp_clf.fit(X_train, y_train)
dt_clf.fit(X_train, y_train)

import matplotlib.pyplot as plt

# make predictions
X_test_20 = X_test[:20] # first 20 for visualisation

rf_pred = rf_clf.predict(X_test_20)
gb_pred = gb_clf.predict(X_test_20)
gp_pred = gp_clf.predict(X_test_20)
dt_pred = dt_clf.predict(X_test_20)
voting_pred = voting_reg.predict(X_test_20)

print(rf_pred)
print(gb_pred)
print(gp_pred)
print(dt_pred)
print(voting_pred)

plt.figure()
plt.plot(gb_pred, "o", color="green", label="GradientBoostingClassifier")
plt.plot(rf_pred, "o", color="blue", label="RandomForestClassifier")
plt.plot(gp_pred, "o", color="darkblue", label="GuassianProcessClassifier")
plt.plot(dt_pred, "o", color="lightblue", label="DecisionTreeClassifier")
plt.plot(voting_pred, "x", color="red", ms=10, label="VotingRegressor")

plt.tick_params(axis="x", which="both", bottom=False, top=False, labelbottom=False)
plt.ylabel("predicted")
plt.xlabel("training samples")
plt.legend(loc="best")
plt.title("Regressor predictions and their average")

plt.show()

print(f'random forest: {rf_clf.score(X_test, y_test)}')

print(f'gradient boost: {gb_clf.score(X_test, y_test)}')

print(f'guassian process: {gp_clf.score(X_test, y_test)}')

print(f'decision tree: {dt_clf.score(X_test, y_test)}')

print(f'voting regressor: {voting_reg.score(X_test, y_test)}')
Loading
Loading