Pet project / Capstone project (2nd) for DataTalks.Club MLOps ZoomCamp`24:
Several models trained and optimized on 2 Churn datasets
3 classifiers used for prediction (known as effective for churn prediction):
- DecisionTreeClassifier
- RandomForestClassifier
- XGBClassifier
Project can be tested and deployed in cloud virtual machine (AWS, Azure, GCP), GitHub CodeSpaces (the easiest option, and free), or locally (GPU is not required).
To reproduce and review this project it would be enough less than an hour. For GitHub CodeSpace option you don't need to use anything extra at all - just your favorite web browser + GitHub account is totally enough.
Harvard Business Review: "Depending on which study you believe, and what industry you’re in, acquiring a new customer is anywhere from 5 to 25 times more expensive than retaining an existing one. It makes sense: you don’t have to spend time and resources going out and finding a new client — you just have to keep the one you have happy". Other statistics show an increase in customer retention by 5% can lead to a company’s profits growing by 25% to around 95% over a period of time. Churn rate is an efficient indicator for subscription-based companies. So I decided to use Machine Learning to predict customer churn using data collected from e-commerce.
This is my 2nd MLOps project started during MLOps ZoomCamp'24.
And the main goal is straight-forward: build an end-to-end Machine Learning project:
- choose dataset
- load & analyze data, preprocess it
- train & test ML model
- create a model training pipeline
- deploy the model (as a web service)
- finally monitor performance
- And follow MLOps best practices!
I found a Churn dataset Ecommerce Customer Churn on Kaggle, analyzed and experimented with different classification models. And I was surprised how hyper parameter optimization could improve results! I was curious enough to find another churn dataset to experiment with it as well. It has a different set of features (only few are similar, like tenure, gender). As a result I managed to build more or less universal pipeline for this task that can easily switch datasets, and I learned a lot while making it possible.
Thanks to MLOps ZoomCamp for the reason to learn many new tools!
- MLFlow for ML experiment tracking
- Prefect for ML workflow orchestration
- Docker and docker-compose
- Localstack as AWS S3 service emulation for development
- Flask for web application
- MongoDB + WhyLogs for performance monitoring
- Setup environment
- Dataset
- Train model
- Test prediction service
- Deployment and Monitoring
- Best practices
- Fork this repo on GitHub.
- Create GitHub CodeSpace from the repo, Start CodeSpace
- Run
pip install -r requirements.txt
to install required packages. - If you want to play with/develop the project, you can also install
pipenv run pre-commit install
to format code before committing to repo.
- Create virtual machine
- Clone repositoty,
cd MLOps-churn
- steps 3. and 4. from above
- Clone repositoty,
cd MLOps-churn
- steps 3. and 4. from above
NB Tested with python 3.11 and 3.12. As packages versions might conflict with yours, Github CodeSpaces could be a good solution. Or you can just create a new virtual environment using python -m venv mlopsenv
and then source mlopsenv/bin/activate
, or by using pipenv install
.
Dataset files are small enough, included in repo. Located in ./train_model/data/
directory.
You can switch dataset in train_model/settings.py
, as well as paths and other parameters.
Data preprocessing includes removing some unuseful columns, fixing missing values and encoding categorical columns. Encoders are different for each dataset as columns set differs, so stored in separate directories (with models).
For more info check out Jupyter notebook
Run bash run-train-model.sh
or go to train_model
directory and run python orchestrate.py
.
This will start Prefect workflow to
- load training data (dataset 1 or 2 according to
settings.py
), training encoder - call
run_experiment()
with different hyper parameters and showing accuracy results (dataset is split - train, test) - call
run_register_model()
to register the best model, which will be saved to./model
sub-directory (1 or 2) - call
test_model()
to verify accuracy on the whole dataset
To explore results go to train_model
directory and run mlflow server
.
NB Your local install/environment might require starting prefect server before running bash run-train-model.sh
, you can do it by prefect server start
.
Prefect workflow is located in orchestrate.py
.
It trains 3 classifiers - DecisionTree, RandomForest and XGBoost. DecisionTree is very fast end quite effective, the others require more time, so I disabled them for your convenience:
for classifier in [
'DecisionTreeClassifier',
# 'RandomForestClassifier',
# 'XGBClassifier',
]:
Feel free to uncomment and test in full.
Full training includes 9 experiments - 3 classifiers x 3 estimators, and ability to set ranges for other hyper parameters.
Inside the experiment (train_model()
in train_model.py
) are more specific hyper parameners for the classifiers.
Integration test is very similar to deployment.
Test parameters are set in test.env
file, including settings for AWS account and S3 bucket. You need to set them correctly for your deployment, otherwise it work with localstack emulation of AWS S3 service.
Run bash test-service.sh
or go to prediction_service
directory and run bash test-run.sh
.
This will
- copy best model and latest scripts,
- build docker image,
- run it, and
- run
test-api.py
to execute requests to the web service. Finally docker container will be stopped.
Advanced testing can be executed by running docker compose up --build
in prediction_service
(check docker-compose.yaml
settings - MongoDB, Localstack).
To deploy web service set your parameters in test.env
file, then run bash deploy-service.sh
.
Monitoring is made by storing requests and predictions in MongoDb database, then using WhyLogs (data-drift-test.py
) to check data/prediction drift example.
* [x] Unit tests
* [x] Integration test (== Test prediction service)
* [x] Code formatter (isort, black)
* [x] Makefile
* [x] Pre-commit hooks
* [x] Github workflow for testing on push/pull request
By using 3 classifiers and tuning different hyper parameters I managed to achive 99% accuracy for dataset 1. The best results achieved by using XGBClassifier. To be honest I was surprised how those hyperparameters affect prediction accuracy! You have very low chances to find optimal combination just by playing with Jupyter notebooks - MLFlow rules! Another surprise is that DecisionTreeClassifier can be quite close in accuracy with much faster execution! Of course, it depends on dataset.
You can find additional information which parameners result better performance on screenshots.
As I mentioned, I experimented with 2 datasets, and made web service flexible enough to
- recognize change of dataset and redirect to respective model prediction
- update model files from S3 bucket, making possible to monitor data, retrain model and command service to 'upgrade' without restarting the service.
- check service parameters by /status request (check
app.py
)
That was fun!
What's interesting about churn prediction? I found another dataset - new experiments ahead!
So stay tuned! (you can ⭐️star the repo to be notified about updates).
🙏 Thank you for your attention and time!
- If you experience any issue while following this instruction (or something left unclear), please add it to Issues, I'll be glad to help/fix. And your feedback, questions & suggestions are welcome as well!
- Feel free to fork and submit pull requests.
If you find this project helpful, please ⭐️star⭐️ my repo https://github.com/dmytrovoytko/MLOps-churn-prediction to help other people discover it 🙏
Made with ❤️ in Ukraine 🇺🇦 Dmytro Voytko