Introduction to classification models by using R and tidymodels
Hello and welcome to this learning adventure! In this folder, you will find a Classification Challenge Notebook. This is basically an autograding guided assessment notebook that will help you test your understanding in using R to create models that predict which category an item belongs to. So grab your wand and let the adventure begin!
Goal | Description |
---|---|
What will you learn | How to create classification models in R |
What you'll need | Visual Studio Code, Docker Desktop, Remote Developer Extension and Git |
Duration | 1.5 to 2 Hours |
Slides | Powerpoint |
🎥 Click this image to watch Carlotta walk you through the workshop material and to gain some tips about delivering this workshop.
This workshop allows learners to use the skills learnt in the module Introduction to classification models by using R and tidymodels to create their own classification models. As such, learners are encouraged to go through the module beforehand so as to be conversant with some of the concepts covered in this workshop. This workshop is the third of a series designed to get started with data science in R. So you may also want to have a look to the first and second workshops of the series, dealing with exploratory data analysis and regression.
To get you up and running and writing R code in no time, we have containarized this workshop such that you have a ready out of the box R coding environment.
A development container is a running Docker container with a well-defined tool/runtime stack and its prerequisites. You can try out development containers with GitHub Codespaces or Visual Studio Code Remote - Containers.
Follow these steps to open this workshop in a Codespace:
- Click the Code drop-down menu and select the Open with Codespaces option.
- Select + New codespace at the bottom on the pane.
For more info, check out the GitHub documentation.
This workshop is also available on Binder. To open the notebook in a Binder environment, just click the button below.
You can go through this challenge also leveraging on the Learn Sandbox environment, provided by Unit 9 of the MS Learn module - Introduction to classification models by using R and tidymodels. Just sign in with your Microsoft or GitHub account and click on Activate sandbox to start.
Follow these steps to open this workshop in a container using the VS Code Remote - Containers extension:
-
If this is your first time using a development container, please ensure your system meets the pre-reqs (i.e. have Docker installed) in the getting started steps.
-
Press F1 select and Add Development Container Configuration Files... command for Remote-Containers or Codespaces.
Note: If needed, you can drag-and-drop the
.devcontainer
folder from this sub-folder in a locally cloned copy of this repository into the VS Code file explorer instead of using the command. -
Select this definition. You may also need to select Show All Definitions... for it to appear.
-
Finally, press F1 and run Remote-Containers: Reopen Folder in Container to start using the definition.
This definition includes some test code that will help you verify it is working as expected on your system. Open the all-systems-check
folder where you can choose to run the .R
, .Rmd
or .ipynb
scripts. You should see "Hello, remote world!" in an R terminal window (for .R
and .Rmd
) or within a Jupyter Notebook (for .ipynb
) after the respective script executes.
At some point, you may want to make changes to your container, such as installing a new package. You'll need to rebuild your container for your changes to take effect.
Let's say we are Data Scientists in a wine making company. Your company wants to produce the best wine and you are tasked with working with the company's wine records and wine expert to realise this.
In this challenge, you will train a classification model to analyze the chemical and visual features of wine samples and classify them based on their cultivar (grape variety).
Rarely do we find data in the right form for analysis. A great way of starting your data analysis is to clean your data to make it consistent with the functions you'll use in your analysis pipeline.
In this section you will:
-
Import the data and identify the
features
(predictors) andlabel
(outcome) variables that you'll be working with. -
Encode the outcome column to a categorical variable.
The goal of data exploration is to try to understand the relationships
between its attributes; in particular, any apparent correlation between the features and the label your model will try to predict. One way of doing this is by using data visualization. Data visualization enables you to identify trends and raise new questions about the data.
In this section you will:
-
Restructure the data such that you can easily plot the data as facets: subplots that each display one subset of the data
-
Visualize and explore the relationship between the outcome and each feature using boxplots.
Now, time to train some models. It’s best practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller “hold-back” set with which to validate the trained model. This enables us to evaluate how well the model performs in order to get a better estimate of how your models will perform on new data.
In this section you will:
-
Create a data splitting specification - i.e. what proportion goes to training and what goes to testing.
-
Extract the training and testing sets.
Now that we have a set of training features and corresponding training label, we can fit a multiclass classification algorithm to the data to create a model. A recipe is an object that defines a series of steps for data processing. In practice, it's common to perform some preprocessing of the data to make it easier for an algorithm to fit a model to it.
In this section, you will:
-
Create a multinomial model specification.
-
Create a recipe that specifies that all the numeric predictors should be normalized.
-
Bundle the model specification and recipe into a workflow.
-
Fit the entire workflow to the data, so that the model encapsulates all of the preprocessing steps as well as the multinomial model.
Once you have a trained model using the training set, you will want to evaluate how well (or not) it will perform on new data.
In this section, you will:
-
Use the trained model to make wine variety predictions and generate their corresponding probability predictions for the
test set
. -
Evaluate model performance using a
confusion matrix
. -
Evaluate model performance using other metrics that arise from the confusion matrix such as:
accuracy
,precision
andrecall
. -
Evaluate model performance using the
ROC curve
andROC Area Under Curve
.
Overall, did your model do a great job in classifying the wine varieties?
If you are satisfied with the model performance, you can save it to be used later. You can then load it whenever you need it, and use it to predict labels for new data. This is often called scoring or inferencing.
In this section you will:
- Save your trained model, and then use it to predict the variety for the following wines:
alcohol | malic_acid | ash | alcalinity | magnesium | phenols | flavanoids | nonflavanoids | proanthocyanins | color_intensity | hue | OD280_315_of_diluted_wines | proline |
---|---|---|---|---|---|---|---|---|---|---|---|---|
13.72 | 1.43 | 2.5 | 16.7 | 108 | 3.4 | 3.67 | 0.19 | 2.04 | 6.8 | 0.89 | 2.87 | 1285 |
12.37 | 0.94 | 1.36 | 10.6 | 88 | 1.98 | 0.57 | 0.28 | 0.42 | 1.95 | 1.05 | 1.82 | 520 |
Test your knowledge with a short quiz!
Congratulations on finishing this classification challenge 🏅!
There are other workshops for creating machine learning models in R. In this workshop, we learnt how to create a machine learning model that predicts categories, or classes. The next module will show you how to create clustering models (coming soon!).
In this workshop, you trained a single model (multinomial classifcation model) to analyze the chemical and visual features of wine samples and classify them based on their cultivar (grape variety). Sometimes, a data practitioner may need to try out a couple of models. Try using other models discussed in this workshop to do the same. Try tuning some model hyperparameters while at it too. Do you obtain better evaluation metrics?
Be sure to give feedback about this workshop! Happy Learning!