This part of the tutorial shows how you can train your own biomedical named entity recognition models using state-of-the-art word embeddings.
For this tutorial, we assume that you're familiar with the base types of Flair and how word embeddings and flair embeddings work. You should also know how to load a corpus.
Here is example code for a biomedical NER model trained over NCBI_DISEASE
corpus, using word embeddings
and flair embeddings based on biomedical abstracts from PubMed and full-texts from PMC.
from flair.datasets import NCBI_DISEASE
# 1. get the corpus
corpus = NCBI_DISEASE()
print(corpus)
# 2. make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type="ner", add_unk=False)
# 3. initialize embeddings
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
embedding_types = [
# word embeddings trained on PubMed and PMC
WordEmbeddings("pubmed"),
# flair embeddings trained on PubMed and PMC
FlairEmbeddings("pubmed-forward"),
FlairEmbeddings("pubmed-backward"),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# 4. initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type="ner",
use_crf=True,
locked_dropout=0.5
)
# 5. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(
base_path="taggers/ncbi-disease",
train_with_dev=False,
max_epochs=200,
learning_rate=0.1,
mini_batch_size=32
)
Once the model is trained you can use it to predict tags for new sentences. Just call the predict method of the model.
# load the model you trained
model = SequenceTagger.load("taggers/ncbi-disease/best-model.pt")
# create example sentence
from flair.data import Sentence
sentence = Sentence("Women who smoke 20 cigarettes a day are four times more likely to develop breast cancer.")
# predict tags and print
model.predict(sentence)
print(sentence.to_tagged_string())
If the model works well, it will correctly tag "breast cancer" as disease in this example:
Women who smoke 20 cigarettes a day are four times more likely to develop breast <B-Disease> cancer <E-Disease> .
Next to training a model completely from scratch, there is also the opportunity to just fine-tune
the HunFlair models (or any other pre-trained model) to your target domain / corpus.
This can be advantageous because the pre-trained models are based on a much broader data base,
which may allows a better and faster adaptation to the target domain. In the following example
we fine-tune the hunflar-disease
model to the NCBI_DISEASE
:
# 1. load your target corpus
from flair.datasets import NCBI_DISEASE
corpus = NCBI_DISEASE()
# 2. load the pre-trained sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger.load("hunflair-disease")
# 3. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# 4. fine-tune on the target corpus
trainer.train(
base_path="taggers/hunflair-disease-finetuned-ncbi",
train_with_dev=False,
max_epochs=200,
learning_rate=0.1,
mini_batch_size=32
)
HunFlair consists of distinct models for the entity types cell line, chemical, disease, gene/protein and species. For each entity multiple corpora are used to train the model for the specific entity. The following code examples illustrates the training process of HunFlair for cell line:
from flair.datasets import HUNER_CELL_LINE
# 1. get all corpora for a specific entity type
from flair.models import SequenceTagger
corpus = HUNER_CELL_LINE()
# 2. initialize embeddings
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
embedding_types = [
WordEmbeddings("pubmed"),
FlairEmbeddings("pubmed-forward"),
FlairEmbeddings("pubmed-backward"),
]
embeddings = StackedEmbeddings(embeddings=embedding_types)
# 3. initialize sequence tagger
tag_dictionary = corpus.make_label_dictionary(label_type="ner", add_unk=False)
tagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type="ner",
use_crf=True,
locked_dropout=0.5
)
# 4. train the model
from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus)
trainer.train(
base_path="taggers/hunflair-cell-line",
train_with_dev=False,
max_epochs=200,
learning_rate=0.1,
mini_batch_size=32
)
Analogously, distinct models can be trained for chemicals, diseases, genes/proteins and species using
HUNER_CHEMICALS
, HUNER_DISEASE
, HUNER_GENE
, HUNER_SPECIES
respectively.