Skip to content

Latest commit

 

History

History
90 lines (63 loc) · 5.51 KB

File metadata and controls

90 lines (63 loc) · 5.51 KB

Spacy 3.0

In this NLP tip we'll have a look at spaCy 3 which introduces a lot of cool new stuff 🔥.

In this tip, which is more geared towards production usage, we're going to do showcase two of our favourite new features:

  1. transformer-based pipelines: which allows you to, next to the already available components in the pipeline, integrate a (huggingface) pre-trained transformer.
  2. wrapping pipelines in a spaCy project: one of the great new features to streamline your end-to-end workflows. It handles everything from start to finish: downloading and processing your datafiles, training, evaluation and visualizing your trained models and packaging your best model for future use.

☝️ Before getting started

Before we dive in, first a quick word on the target audience. We assume that you already have a basic understanding of how spaCy works. If you're just getting started, have a look at the spaCy 101 to brush up on some NLP basics and the way spaCy is structured.

🏋️ Training a German NER model

Since the pre-trained German transformer pipeline does not contain a NER component anymore, we've created a spaCy project that can be used to train a German NER model on a combination of 2 datasets (germeval and wikiner).

Since we're only going to use training components that are already available in spaCy, we don't need to write actual Python code to run the training! All we need to do is make sure the data is in the correct format (using the data_*.py scripts) and run the spaCy train CLI command with the desired config.
We've provided 3 different config files:

  1. cpu_eff.cfg: Runs fast on CPU at the cost of lower accuracy.
  2. cpu_acc.cfg: Runs slower on CPU but has a better accuracy.
  3. gpu_trf.cfg: Runs on GPU (or very slowly on CPU) and has the best accuracy.

In order to switch between them, all you need to do is change the config variable in the project.yml file and update the gpu variable to -1 for cpu and 0 for gpu. If you're running on GPU, also make sure to install the correct extra dependency.
Finally, we've also provided two commands (implemented in the visualize_*.py scipts) to visualize the input data and the trained model using Displacy and Streamlit.

⚙️ Running the project end-to-end

Running everything end to end is pretty easy:

  1. Run spacy project assets. This will fetch the necessary assets (wikiner and germeval datasets) from the locations provided in the project definition.
  2. Run spacy project run all. This will run all the steps defined in the all workflow (i.e. corpus, train, evaluate). It will prepare the datasets and combine them, run the training using the config you selected, evaluate the trained model on the test set.
  3. Run spacy project run visualize-model. This will launch a streamlit app to visualize the model outputs on custom input you provide.
  4. Run spacy project run package. This will package your model so it can be installed and reused in the future.

Make sure to check out the excellent documentation if you want to learn more about spaCy projects and the project.yml structure or the structure of the training configuration files.

The following part of the readme explains what commands are available in the spaCy project and how you can run them. Btw, the rest of this readme was automatically generated by spaCy, based on the project.yml 😲!

spaCy Project: German Named Entitity Recognition

This project uses data from the germeval and wikiner datasets to train a German NER model.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
corpus Convert the data to spaCy's format
train Train the full pipeline
evaluate Evaluate on the test data and save the metrics
visualize-model Visualize the model's output interactively using Streamlit
visualize-data Explore the annotated data in an interactive Streamlit app
package Package the trained model so it can be installed
clean Remove intermediate files

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all corpustrainevaluate

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/wikiner/aij-wikiner-de-wp3.bz2 URL
assets/germaner/germaner_train.tsv URL
assets/germaner/germaner_dev.tsv URL
assets/germaner/germaner_test.tsv URL