- Overview
- Demo videos
- Tools / Technologies
- Adopted practices
- Service ports
- How to use
- How everything works together
- From On-Premises to On-Cloud
- DB schemas
- Troubleshooting
Welcome to our comprehensive on-premises MLOps ecosystem designed specifically for Computer Vision tasks, with a primary focus on image classification. This repository equips you with everything you need, from a development workspace in Jupyter Lab/Notebook to production-level services. The best part? It only takes "1 config and 1 command" to run the whole system from building the model to deployment! We've integrated numerous best practices to ensure scalability and reliability while maintaining flexibility. While our primary use case revolves around image classification, our project structure can easily adapt to a wide range of ML/DL developments, even transitioning from on-premises to cloud!
Another goal is to show how to integrate all these tools and make them work together in one full system. If you're interested in specific components or tools, feel free to cherry-pick what suits your project's needs.
The entire system is containerized into a single Docker Compose file. To set it up, all you have to do is run docker-compose up
! This is a fully on-premises system, which means no need for a cloud account, and it won't cost you a dime to use the entire system!
We highly recommend watching the demo videos in the Demo videos section to get a comprehensive overview and understand how to apply this system to your projects. These videos contain important details that might be too long and not clear enough to cover here.
Demo: https://youtu.be/NKil4uzmmQc
In-depth technical walkthrough: https://youtu.be/l1S5tHuGBA8
Resources in the video:
To use this repository, you only need Docker. For reference, we use Docker version 24.0.6, build ed223bc and Docker Compose version v2.21.0-desktop.1 on Mac M1.
- Platform: Docker
- Workspace: Jupyter Lab
- Deep Learning framework: TensorFlow
- Data versioning: DvC
- Data validation: DeepChecks
- Machine Learning platform / Experiment tracking: MLflow
- Pipeline orchestrator: Prefect
- Machine Learning service deployment: FastAPI, Uvicorn, Gunicorn, Nginx (+ HTML, CSS, JS for a simple UI)
- Databases: PostgreSQL (SQL), Prometheus (Time-series)
- Machine Learning model monitoring & drift detection: Evidently
- Overall system monitoring & dashboard: Grafana
We've implemented several best practices in this project:
- Efficient data loader/pipeline using
tf.data
for TensorFlow - Image augmentation with
imgaug
lib for greater flexibility in augmentation options than core functions from TensorFlow - Using
os.env
for important or service-level configs - Logging with the
logging
module instead ofprint
- Database storage for service response results
- Dynamic configuration through
.env
for variables indocker-compose.yml
- Using
default.conf.template
for Nginx to elegantly apply environment variables in Nginx config (new feature in Nginx 1.19) - Configuration of Nginx for terminal log display
- Setting up a Prefect worker to support working on a cluster
Most of the ports can be customized in the .env file at the root of this repository. Here are the defaults:
- JupyterLab: 8888 (pw:
123456789
) - MLflow: 5050
- Prefect: 4200
- PostgreSQL: 5432
- pgAdmin: 16543 (user:
[email protected]
, pw:SuperSecurePwdHere
) - Deep Learning service: 4242
- Web UI interface for Deep Learning service: 4243
- Nginx: 80
- Evidently: 8000
- Prometheus: 9090
- Grafana: 3000 (user:
admin
, pw:admin
)
You have to consider comment those platform: linux/arm64
lines in docker-compose.yml
if you not using an ARM-based computer (we're using Mac M1 for development). Otherwise, this system is not gonna work.
- Clone this repo. There are 2 submodules in this repo, so consider using
--recurse-submodules
flag in your command:git clone --recurse-submodules https://github.com/jomariya23156/full-stack-on-prem-cv-mlops
- [For users with CUDA] If you have CUDA-compatible GPU(s), you can uncomment
deploy
section underjupyter
service indocker-compose.yml
and change the base image inservices/jupyter/Dockerfile
fromubuntu:18.04
tonvidia/cuda:11.4.3-cudnn8-devel-ubuntu20.04
(the text is there in the file, you just need to comment and uncomment) to leverage your GPU(s). You might also need to installnvidia-container-toolkit
on the host machine to make it work. For Windows/WSL2 users, we found this article very helpful. - At the root of the repo directory, run
docker-compose up
ordocker-compose up -d
to detach the terminal. - On the first time, it can take a while due to the size of images, especially for jupyter since it contains a lot of packages and libraries. Generally, it can take from 5 to 20 minutes.
- Go to the DvC submodule at
datasets/animals10-dvc
and follow the steps in the How to use section.
- Open the Jupyter lab on port 8888
http://localhost:8888/lab
- Go to the workspace directory
cd ~/workspace/
- Activate the conda environment (the name is configurable in
docker-compose.yml
)conda activate computer-viz-dl
- Run
python run_flow.py --config configs/full_flow_config.yaml
- Sit back and watch your brand-new classifier be built, trained, evaluated, deployed (at scale), and monitored on a fully operating system!
- There are a lot of things working together, and it's hard to talk into detail every bit. There is no better way than getting your hands dirty aka read codes, try to understand, and try customizing it yourself!
- Anyways, there are some guidelines you can follow to make components work all together seamlessly as they should:
- Your tasks should be created inside the
tasks
directory - All your tasks are supposed to be called from the flows which are created inside the
flows
directory - Your flows should be called with
run_flow.py
at the root of the repo. - In order to be called this way, you have to implement
start(config)
function in your flow file. This function accepts the config as a Python dict and then basically calls the specific flow in that file. - Datasets should live inside the
datasets
directory and they all should have the same directory structure as the one inside this repo. central_storage
at~/ariya/
should contain at least 2 subdirectories namedmodels
andref_data
. Thiscentral_storage
serves the object storage purpose of storing all staged files to be used across developing and deploying environments. (This is one of the things you could consider changing to a cloud storage service in case you want to deploy on-cloud and make it more scalable)
- Your tasks should be created inside the
IMPORTANT Conventions to be SUPER EXTRA CAREFUL if you want to change (because these things are tied and used in different parts of the system):
central_storage
path -> Inside there should bemodels/
ref_data/
subdirectories- File naming in central_storage e.g.
<model_name>.yaml
,<model_name>_uae
,<model_name>_bbsd
,<model_name>_ref_data.parquet
- All database schemas (columns) -> they are linked in many places (mainly dl_service, prefect_worker/repo, evidently)
- Key/Name of Prefect Variables
current_model_metadata_file
andmonitor_pool_name
- Prefect version 2.13.2, there are bugs in templating prefect.yaml file and they got fixed in this version. So if not necessary, don’t go below this version otherwise, you need to make changes to Prefect-related files.
- Evidently version 0.4.5, the mmd’s bug, which we use as a method for embedding drift detection, was fixed in this version. Again if you want to change the version, try not to go below 0.4.5.
- Jupyter Lab serves as your workspace for coding. It includes a pre-installed Conda environment named
computer-viz-dl
(default value), with all the required packages for this repository. All the Python commands/codes are supposed to be run within this Jupyter. - Prefect orchestrates all main execution codes, including tasks and flows.
- The
central_storage
volume acts as the central file storage used throughout development and deployment. It mainly contains model files (including drift detectors) and reference data in Parquet format. At the end of the model training step, new models are saved here, and the deployment service pulls models from this location. (Note: This is an ideal place to replace with cloud storage services for scalability.) - These are the step-by-step explanations of what happens when you run the full flow. The full flow comprises of 3 subflows; train, evaluate, and deploy running sequentially. Each flow has its own set of config files, it can be a dedicated .yaml file for each flow or it can be only 1 .yaml file for the full flow (take a look at files in the config folder):
- Train flow
- Read the config.
- Use the
model
section in the config to build a classifier model. The model is built with TensorFlow and its architecture is hardcoded attasks/model.py:build_model
. - Use the
dataset
section in the config to prepare a dataset for training. DvC is used in this step to check the consistency of the data in the disk compared with the version specified in the config. If there are changes, it converts it back to the specified version programmatically. If you want to keep the changes, in case you're experimenting with the dataset, you can setdvc_checkout
field in the config to false so that DvC won't do its things. - DeepChecks then validates the prepared dataset and saves the result report. You can add some conditions in this step. For example, if some serious tests fail, terminate the process so that it doesn't train a bad model.
- Use the
train
section in the config to build a data loader and start the training process. Experiment info and artifacts are tracked and logged with MLflow. Note: the result report (in a .html file) from DeepChecks is also uploaded to the training experiment on MLflow for the convention. - Build the model metadata file from the
model
section in the config. - Save the trained model and its corresponding metadata file to the local disk.
- Upload the model and model metadata files to
central_storage
(in this case, it's just making a copy tocentral_storage
location. This is the step you can change to uploading files to cloud storage) - Build drift detectors based on the trained model and the
model/drift_detection
section in the config. - Save and upload the drift detectors to
central_storage
. - Generate reference data using the drift detectors and the dataset.
- Save the reference data in .parquet and upload it to
central_storage
. - Return the path to the uploaded models and model metadata file for the next flow.
- Evaluation flow
- Load saved models and model metadata files.
- Prepare a dataset for testing (reuse the same task as in train flow)
- Build a data loader from the config and evaluate the model
- Log the results to MLflow
- Deployment flow
- PUT request to trigger the running service (served with FastAPI + Uvicorn + Gunicorn + Nginx) to fetch the newly trained model from
central_storage
. (this is one concern discussed in the tutorial demo video, watch it for more detail) - Create or update, if existed, Prefect variables for monitoring configuration. Mainly 2 variables which are
current_model_metadata_file
storing model metadata file name ended with .yaml andmonitor_pool_name
storing the work pool name for deploying Prefect worker and flows. - Deploy the Prefect monitor flow which internally fetches the data from PostgreSQL and uses Evidently to compute data drift-related reports and metrcs. Programmatically
cd
intodeployments/prefect-deployments
and runprefect --no-prompt deploy --name {deploy_name}
using inputs from thedeploy/prefect
section in the config. - The monitor flow is scheduled to run weekly aka once a week. But you can run the deployed flow manually from Prefect UI. Check out their official doc on how to do this (pretty simple and straightforward)
- You can also view the data drift dashboard on Evidently UI (port 8000, by default)
- PUT request to trigger the running service (served with FastAPI + Uvicorn + Gunicorn + Nginx) to fetch the newly trained model from
- Train flow
Since everything is already dockerized and containerized in this repo, converting the service from on-prem to on-cloud is pretty straightforward. When you finish developing and testing your service API, you can just spin off services/dl_service by building the container from its Dockerfile, and push it to a cloud container registry service (AWS ECR, for example). That's it!
Note: There is one potential problem in the service code if you want to use it in a real production environment. I have addressed it in the in-depth video and I recommend you to spend some time watching the whole video.
We have three databases inside PostgreSQL: one for MLflow, one for Prefect, and one that we've created for our ML model service. We won't delve into the first two, as they are self-managed by those tools. The database for our ML model service is the one we've designed ourselves.
To avoid overwhelming complexity, we've kept it simple with only two tables. The relationships and attributes are shown in the ERD below. Essentially, we aim to store essential details about incoming requests and our service's responses. All these tables are created and manipulated automatically, so you don't need to worry about manual setup.
Noteworthy: input_img
, raw_hm_img
, and overlaid_img
are base64-encoded images stored as strings. uae_feats
and bbsd_feats
are arrays of embedding features for our drift detection algorithms.
- If you face
ImportError: /lib/aarch64-linux-gnu/libGLdispatch.so.0: cannot allocate memory in static TLS block
error, tryexport LD_PRELOAD=/lib/aarch64-linux-gnu/libGLdispatch.so.0
then rerun your script.