This repository was created as part of the Machine Learning Bootcamp by Alexey Grigorev. This project has been submitted as the midterm project for the course. It take a data set about coffee quality with various attributes around the raw coffee including it's origin and processing as well as the assigned quality scores. I chose this dataset as before working as a software engineer I was a coffee roaster so I have pre existing domain knowledge and a passion for coffee ☕ If you notice any mistakes/ improvements to the code feel free to open an issue 💖
Coffee quality is assessed by tastings called cuppings where the coffee is scored on various qualities such as aroma, balance, sweetness and cleanliness of the flavour. These scores create a total cupping score. Speciality coffees are normally those which score over 85 points and coffee over 90 are considered exceptionally high quality. The quality of a coffee is dependant on a number of factors including the varietal, growing conditions, processing method as well as the storage of the green sample before roasting.
Predicting quality before roasting could be useful for farmers who often don't have the means to cup their own coffee or for buyers looking to narrow down the selection of coffee they are tasting based on a range of scores (ie only tasting coffees predicted to be over 90 points or between 85 - 90)
This project focuses on a classification problem, predicting if a coffee samples total cup score is above 85, however it could also be interesting to look at the individual scores to predict for example a coffee's sweetness or aroma score and potentially offer recommendations on coffees based on a user's preference.
Where to find the files for evaluation :)
-
📂 Analysis
- this directory has all the notebooks for exploring the data as well as building, tuning and evaluating the model.
- Data preparation and EDA notebook
- Model selection notebook
-
📂 App
- this directory contains the code for the flask websever which can be used to make predictions on new samples using the final trained model. Here you will find the scripts train.py and predict.py as well as the Pipenv and Docker files for running the service.
-
📂 Deployment
- this project has been deployed here on Heroku using the instructions kindly shared by Ninad in the course slack channel. This end point will remain available until the end of the evaluation period.
- example request to this end point would be:
curl --header "Content-Type: application/json" \ --request POST \ -d @app/test_data/coffee_sample.json \ https://coffee-quality-prediction.herokuapp.com/predict
I advise using a virtual environment for running this project, below are instructions for doing so using venv which you can install on linux with the following command pip install venv
. Additionally if you would like to run the analysis notebooks or the app in Docker you will need to have Docker installed.
# create virtual environment
python3 -m venv coffee-quality
# start the virtual environment
source coffee-quality/bin/activate
# install virtual environment depencies for linting locally and
# pipenv required for the app's dependency management
pip install -r requirements.txt
Then go to the respective README for further instructions on running either the analysis notebooks or the prediction flask application
The data used for this project is gathered from Coffee Quality Institute (CQI) in January, 2018. Scraping was performed by James LeDoux and more details can be found here
The project is linted with Black and Flake8 and it is reccomended running these both locally (they are already installed inside the virtual environment following the instructions above) before pushing code as they are enforced in the github actions (see below).
When pushing code to github it will run actions for linting the code, you can find, add and update these actions in .github/workflows.