This project aims at managing the lifecycle of a Fonduer-based application. Roughly a Fonduer-based app lifecycle has three phases: development, training, and serving.
Phase | Framework / Interface |
---|---|
Development | Jupyter Notebook / Web GUI |
Training | MLflow Project / CLI |
Serving | MLflow Model / Rest API |
In the development phase, a developer writes Python codes in that a parser, mention/candidate extractors, labeling functions, and a classifier are defined. Once they are defined, a model can be trained using a training document set. A trained model will be deployed and will serve to extract knowledge from a new document.
Jupyter Notebook might be good for development but not always good for training and serving. This project uses MLflow in the training phase for reproducibility (of training) and in the serving phase for packageability (of a trained model).
Contributions to the Fonduer project include
- Defined a Fonduer model: what it includes, which parts are common/different for different apps.
- Created a custom MLflow model for Fonduer, which can be used to package a trained Fonduer model, deploy it, and let it serve.
- MLflow (v1.1.0 or higher)
- Anaconda or Miniconda
- Docker (optional)
fonduer_model.py
defines FonduerModel
that is a custom MLflow model (see here for details) for Fonduer.
A developer is supposed to create a class, say MyFonduerModel
, that inherits FonduerModel
and implements _classify(self, doc: Document) -> DataFrame
.
Also, a developer is supposed to create fonduer_subclasses.py
and fonduer_lfs.py
, each of which defines mention/candidate subclasses and labeling functions, respectively.
Download data.
$ ./download_data.sh
Deploy a PostgreSQL if you don't have one.
$ docker run --name postgres -e POSTGRES_USER=`whoami` -d -p 5432:5432 postgres
Create a database.
$ docker exec postgres createdb -U `whoami` pob_presidents
$ mlflow run ./ -P conn_string=postgresql://localhost:5432/pob_presidents
A trained Fonduer model will be saved at ./fonduer_model
with the following contents.
$ tree fonduer_model
fonduer_model
├── MLmodel
├── code
│ ├── fonduer_model.py
│ ├── fonduer_subclasses.py
│ └── my_fonduer_model.py
├── conda.yaml
└── model.pkl
This fonduer_model
folder, conforming to the MLflow Model, is portable and can be deployed anywhere.
Note that the trained model can also be found under ./mlruns/<experiment-id>/<run-id>/artifacts
.
There are a few ways to deploy a MLflow-compatible model (see here for details). Let me show you one of the ways.
$ mlflow models serve -m fonduer_model -w 1
or alternatively,
$ mlflow models serve -m runs:/<run-id>/fonduer_model -w 1
If you send the following request to the API endpoint (http://127.0.0.1:5000/invocations
in this case)
$ curl -X POST -H "Content-Type:application/json; format=pandas-split" \
--data '{"columns":["html_path"], "data":["data/new/Al_Gore.html"]}' \
http://127.0.0.1:5000/invocations
You will get a response like below:
[
{
"Presidentname": "Al Gore",
"Placeofbirth": "Washington",
"html_path": "data/new/Al_Gore.html"
}
]
MLflow should be v1.8.0 or higher (mlflow/mlflow#2691, mlflow/mlflow#2699).
Build a Docker image
$ mlflow models build-docker -m fonduer_model -n fonduer_model
Deploy
$ docker run -p 5000:8080 -v "$(pwd)"/data:/opt/mlflow/data fonduer_model
Most of the initial codes were derived from the wiki tutorial of fonduer-tutorials. The Jupyter Notebook was converted to a Python script as follows:
$ jupyter nbconvert --to script some.ipynb
$ sed -i -e "s/get_ipython().run_line_magic('matplotlib', 'inline')/import matplotlib\nmatplotlib.use('Agg')/" some.py