In this NLP tip we'll have a look at spaCy 3 which introduces a lot of cool new stuff 🔥.
In this tip, which is more geared towards production usage, we're going to do showcase two of our favourite new features:
- transformer-based pipelines: which allows you to, next to the already available components in the pipeline, integrate a (huggingface) pre-trained transformer.
- wrapping pipelines in a spaCy project: one of the great new features to streamline your end-to-end workflows. It handles everything from start to finish: downloading and processing your datafiles, training, evaluation and visualizing your trained models and packaging your best model for future use.
Before we dive in, first a quick word on the target audience. We assume that you already have a basic understanding of how spaCy works. If you're just getting started, have a look at the spaCy 101 to brush up on some NLP basics and the way spaCy is structured.
Since the pre-trained German transformer pipeline does not contain a NER component anymore, we've created a spaCy project that can be used to train a German NER model on a combination of 2 datasets (germeval and wikiner).
Since we're only going to use training components that are already available in spaCy, we don't need to write actual Python code to run the training! All we need to do is make sure the data is in the correct format (using the data_*.py
scripts) and run the spaCy train CLI command with the desired config.
We've provided 3 different config files:
- cpu_eff.cfg: Runs fast on CPU at the cost of lower accuracy.
- cpu_acc.cfg: Runs slower on CPU but has a better accuracy.
- gpu_trf.cfg: Runs on GPU (or very slowly on CPU) and has the best accuracy.
In order to switch between them, all you need to do is change the config
variable in the project.yml
file and update the gpu
variable to -1 for cpu and 0 for gpu. If you're running on GPU, also make sure to install the correct extra dependency.
Finally, we've also provided two commands (implemented in the visualize_*.py
scipts) to visualize the input data and the trained model using Displacy and Streamlit.
Running everything end to end is pretty easy:
- Run
spacy project assets
. This will fetch the necessary assets (wikiner and germeval datasets) from the locations provided in the project definition. - Run
spacy project run all
. This will run all the steps defined in theall
workflow (i.e. corpus, train, evaluate). It will prepare the datasets and combine them, run the training using the config you selected, evaluate the trained model on the test set. - Run
spacy project run visualize-model
. This will launch a streamlit app to visualize the model outputs on custom input you provide. - Run
spacy project run package
. This will package your model so it can be installed and reused in the future.
Make sure to check out the excellent documentation if you want to learn more about spaCy projects and the project.yml structure or the structure of the training configuration files.
The following part of the readme explains what commands are available in the spaCy project and how you can run them. Btw, the rest of this readme was automatically generated by spaCy, based on the project.yml 😲!
This project uses data from the germeval and wikiner datasets to train a German NER model.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
corpus |
Convert the data to spaCy's format |
train |
Train the full pipeline |
evaluate |
Evaluate on the test data and save the metrics |
visualize-model |
Visualize the model's output interactively using Streamlit |
visualize-data |
Explore the annotated data in an interactive Streamlit app |
package |
Package the trained model so it can be installed |
clean |
Remove intermediate files |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
corpus → train → evaluate |
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/wikiner/aij-wikiner-de-wp3.bz2 |
URL | |
assets/germaner/germaner_train.tsv |
URL | |
assets/germaner/germaner_dev.tsv |
URL | |
assets/germaner/germaner_test.tsv |
URL |