NER Transfer learning #351

gawy · 2020-06-15T14:25:23Z

Desciption

With the current code it is possible to train NER model end to end. But in cases where data sets are limited and there is a need to train NER with custom classes, Transfer learning may come very handy. As was in my case.

I have patched current stanza code to allow for that with a bit manipulations on model classifier.

Summary of modifications:

2 flags were added to ner_tagger.py
minor update was done to DataLoader and NERTagger classes - only necessary object properties are passed to constructors instead of full objects themselves - simplifies using those objects in other contexts

Approach to inserting a new classifier
My assumption for TL process was following:

Whoevere to use it will probably have good enough background to mess with network architecture. So my decision was to make all necessary network modifications outside of Stanza code base. Maybe someone else would like to use several FC layers for classifier. This means that within the Stanza code it is just required to load model with modified architecture and proceed with normal training process.

Example of the code that I used to update model classifier and define new classes for NER model. Potentially this can be included somewhere into documentation or examples within Stanza.
https://gist.github.com/gawy/2fec736e6278db6e6a083c26d3ec745b

Example usage:
scripts/run_ner.sh Ukrainian-languk --finetune --train_classifier_only

Flags and reasoning behind them

finetune - tells ner_tagger to load exising model from file instead of creating a new model from scratch for training. Potentially this has 2nd use-case in funetuning the model - that is why the name.
train_classifier_only - ner_tagger will stop gradient from popagating for all layers above classifier (code disables gradient for all layers except those with names containing ['tag_clf', 'crit'])

Experimental results

I've stated with own trained NER model on 4 standard classes (Ukrainian-languk) with F1 score around 84. Any other language model can be equaly used in the same way.

Model was modified to have a new classifier with 2 new classes and trained on a data set that had roughly 200 and 150 examples of each class.

Initial NER model had F1 score of about 84.
Newly trained model showed decent results
Prec. Rec. F1
81.40 77.78 79.55

Manual isnspection in my case also showed nice results - something good enough to be used in practice and further improved.

Fixes Issues

none as far as I can see

Unit test coverage

Existing NER unit tests run successfully. No additional tests were created.
NER training was tested in end-to-end training as well as Transfer learning mode

Known breaking changes/behaviors

none just adds new features

…eddings instead of full pretrain object

…rio)

yuhui-zh15 · 2020-06-20T22:27:06Z

Hi @gawy, thank you for your interest in contributing to Stanza. These codes generally look good to me! I've changed back some data structures to ensure model backward compatibility.

Questions about some details:

How do you solve it when model trains on a dataset which contains different NER labels? I believe building a new TagVocab and modifying the model architecture are necessary. Can you add the related code to your code?
Can you make it clearer by filling the following information?

Performance when training from scratch:
Performance when finetuning from existing models (allow to update all parameters):
Performance when finetuning from existing models (only allow to update final layer):

gawy · 2020-06-22T08:49:33Z

@yuhui-zh15 thank you for your feedback and happy to help

I'll post answers to your questions in several replies:

Q1: How do you solve it when model trains on a dataset which contains different NER labels? I believe building a new TagVocab and modifying the model architecture are necessary. Can you add the related code to your code?

Answer:
That's exactly the case and purpose of the whole thing in my case.
I needed a model that would produce completely different NER labeling compared to stock.

You can look at the code I used to modify model strucutre (classifier) and build new label set in the gist here https://gist.github.com/gawy/2fec736e6278db6e6a083c26d3ec745b

As I mentioned in the original description, I loaded the existing model, modified classfier for the new label set and saved it to file. This way I good start training with minimal changes in Stanza code.

The reason why I made these modification outside of Stanza code base instead of somehow integrating the whole thing was based on my assumptions about how TL could be used by other people.
There are couple of ways how TL can be used (as I see it):

simple case when classifier has simply a different tag set to be trained on
more complicated case when someone might like to change classifier a bit more radically: from 1 FC layer (like it is now) to let's say 2FC layers - might be handy in case of large number of tags.

In case #1 - the most user-friendly way to implement TL would be to drieve tag set a well as configuration of the classfier layer from a data set (the way it is done for a new model) - more modifications in Stanza will be required (mainly in the way how Vocabulary is initialized - similar to how init_vocab in data.py functions). As I'm not sure how popular this features will be - with sample code people can do whatever model and tagset modifications they want and proceed to training.

I'll post answer to Q2 below later

gawy · 2020-06-22T12:50:55Z

Training data set
2 custom tags (dimention of classifier = 13) INT_REF, EXT_REF

train - 1567 examples
dev - 641 examples
test - 1813 examples

Detailed sample data:
dev: Counter({'O': 16882, 'I-EXT_REF': 138, 'I-INT_REF': 107, 'B-INT_REF': 28, 'E-INT_REF': 28, 'B-EXT_REF': 17, 'E-EXT_REF': 17})
test: Counter({'O': 44615, 'I-INT_REF': 361, 'I-EXT_REF': 283, 'B-INT_REF': 114, 'E-INT_REF': 114, 'B-EXT_REF': 40, 'E-EXT_REF': 40})
train: Counter({'O': 40515, 'I-EXT_REF': 555, 'I-INT_REF': 253, 'B-INT_REF': 82, 'E-INT_REF': 82, 'B-EXT_REF': 66, 'E-EXT_REF': 66})

Device: CPU (MacBook Pro with Intel i5)
GPU: Google Colab Tesla P100 16GB

Q2.1: Performance when training from scratch:
Q2.2: Performance when finetuning from existing models (allow to update all parameters):
Q2.3: Performance when finetuning from existing models (only allow to update final layer):

mode	Prec	Rec	F1	Time (cpu)	Time (gpu)
from scratch (Q2.1)	95.90	75.97	84.78	--	60 min
finetune end-to-end (Q2.2)	91.54	77.27	83.80	over 2 hours	53 min
final layer only (Q2.3)	86.99	69.48	77.26	~ 20 min **	--

** - could have been slighltly longer as I've initially stopped it earlier when training results stopped improving

Overall end-to-end shows much better results but training time is dramatically different. Training just the classifier allows to run experiments much faster while data sets are still small

stale · 2020-12-29T18:08:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-01-05T18:20:57Z

This issue has been automatically closed due to inactivity.

AngledLuffa · 2021-01-05T18:57:38Z

I think if this has already been inspected once we should hopefully be able to merge it, right?

yuhui-zh15

Conflicts solved and should be able to merge now.

gawy marked this pull request as draft June 15, 2020 14:27

gawy added 3 commits June 15, 2020 17:50

Decoupling: DataLoader and Trainer to receive just just vocab and emb…

be40038

…eddings instead of full pretrain object

ner_tagger extended to support finetuning of existing model file

9bd6c38

Added support to train only classifier layer (transfer learning scena…

009a4ea

…rio)

gawy force-pushed the ner-transfer_learning branch from 144fc6e to 009a4ea Compare June 15, 2020 14:51

gawy marked this pull request as ready for review June 15, 2020 16:08

yuhui-zh15 added 4 commits June 20, 2020 15:02

remove unuseful import

49d4fd3

restore data.py

0e164cc

restore trainer.py

586d9e5

restore ner_tagger.py

8f71931

stale bot added the stale label Dec 29, 2020

stale bot closed this Jan 5, 2021

AngledLuffa reopened this Jan 5, 2021

stale bot removed the stale label Jan 5, 2021

yuhui-zh15 added 2 commits January 5, 2021 17:18

Merge branch 'dev' into ner-transfer_learning

2cf4f20

update code using lastest trainer

32a0c60

yuhui-zh15 approved these changes Jan 6, 2021

View reviewed changes

yuhui-zh15 merged commit 961c8c0 into stanfordnlp:dev Jan 6, 2021

AngledLuffa mentioned this pull request Jan 6, 2021

[QUESTION] #586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER Transfer learning #351

NER Transfer learning #351

gawy commented Jun 15, 2020

yuhui-zh15 commented Jun 20, 2020

gawy commented Jun 22, 2020

gawy commented Jun 22, 2020

stale bot commented Dec 29, 2020

stale bot commented Jan 5, 2021

AngledLuffa commented Jan 5, 2021

yuhui-zh15 left a comment

NER Transfer learning #351

NER Transfer learning #351

Conversation

gawy commented Jun 15, 2020

Desciption

Experimental results

Fixes Issues

Unit test coverage

Known breaking changes/behaviors

yuhui-zh15 commented Jun 20, 2020

gawy commented Jun 22, 2020

gawy commented Jun 22, 2020

stale bot commented Dec 29, 2020

stale bot commented Jan 5, 2021

AngledLuffa commented Jan 5, 2021

yuhui-zh15 left a comment

Choose a reason for hiding this comment