No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.
Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599
Method | Precision | Recall | F1 |
---|---|---|---|
Supervised Benchmark | 88.84 | 85.16 | 86.96 |
Dictionary Match | 93.93 | 58.35 | 71.98 |
Fuzzy-LSTM-CRF | 88.27 | 76.75 | 82.11 |
AutoNER | 88.96 | 81.00 | 84.80 |
- Tokenized Raw Texts
- Example:
data/BC5CDR/raw_text.txt
- One token per line.
- An empty line means the end of a sentence.
- Example:
- Two Dictionaries
- Core Dictionary w/ Type Info
- Example:
data/BC5CDR/dict_core.txt
- Two columns (i.e., Type, Tokenized Surface) per line.
- Tab separated.
- How to obtain?
- From domain-specific dictionaries.
- Example:
- Full Dictionary w/o Type Info
- Example:
data/BC5CDR/dict_full.txt
- One tokenized high-quality phrases per line.
- How to obtain?
- From domain-specific dictionaries.
- Applying the high-quality phrase mining tool on domain-specific corpus.
- Example:
- Core Dictionary w/ Type Info
- Pre-trained word embeddings
- Train your own or download from the web.
- The example run uses
embedding/bio_embedding.txt
, which can be downloaded from our group's server. For example,curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt
. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in theautoner_train.sh
.
- [Optional] Development & Test Sets.
- Example:
data/BC5CDR/truth_dev.ck
anddata/BC5CDR/truth_test.ck
- Three columns (i.e., token,
Tie or Break
label, entity type). I
isBerak
.O
isTie
.- Two special tokens
<s>
and<eof>
mean the start and end of the sentence.
- Three columns (i.e., token,
- Example:
The dependent package for this project is listed as below:
numpy==1.13.1
tqdm
torch-scope
pytorch==0.4.1
To train an AutoNER model, please run
./autoner_train.sh
To apply the trained AutoNER model, please run
./autoner_test.sh
You can specify the parameters in the bash files. The variables names are self-explained.
Please cite the following two papers if you are using our tool. Thanks!
- Jingbo Shang*, Liyuan Liu*, Xiaotao Gu, Xiang Ren, Teng Ren and Jiawei Han, "Learning Named Entity Tagger using Domain-Specific Dictionary", in Proc. of 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP'18), Brussels, Belgium, Oct. 2018. (* Equal Contribution)
- Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.
@inproceedings{shang2018learning,
title = {Learning Named Entity Tagger using Domain-Specific Dictionary},
author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei},
booktitle = {EMNLP},
year = 2018,
}
@article{shang2018automated,
title = {Automated phrase mining from massive text corpora},
author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
journal = {IEEE Transactions on Knowledge and Data Engineering},
year = {2018},
publisher = {IEEE}
}