This repository contains data, code and Jupyter notebooks for the validation of data science projects tutorial. The tutorial consists of three sections for each step in the production data science model life cycle:
- Database management (using Great Expectations)
- Training pipeline (using Pandera)
- Model serving (using Pydantic)
Each section comes with a notebook in which there are explanations, code snippets and exercises.
If you would like to see me run through these notebooks from PyData London 2022, you can navigate to this YouTube video: Data Validation for Data Science | PyData London 2022
Dataset used for the purposes of this tutorial is taken from the House prices
prediction competition on Kaggle.
Two CSV files located in the data
folder: train.csv
and test.csv
.
To Follow the notebooks and exercises there are two options:
- Use your own Python environment with Jupyter installed. The notebooks are run using
the
jupyter notebook
command, select the notebook you want to run in thenotebooks
folder and follow the instructions. For running the different tools with all of the features available it is recommended to usePython 3.8
and up. - Use Google Colaboratory without any pre-installation needed. Click the link to go to
the repository's GitHub page. Choose one
of the notebooks in the
notebooks
folder and from the interactive view, click on the link toopen in Colab
.