AutoNER

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599

Model notes
Benchmarks
Training
Citation

Model Notes

Benchmarks

Method	Precision	Recall	F1
Supervised Benchmark	88.84	85.16	86.96
Dictionary Match	93.93	58.35	71.98
Fuzzy-LSTM-CRF	88.27	76.75	82.11
AutoNER	88.96	81.00	84.80

Training

Required Inputs

Tokenized Raw Texts
- Example: data/BC5CDR/raw_text.txt
  - One token per line.
  - An empty line means the end of a sentence.
Two Dictionaries
- Core Dictionary w/ Type Info
  - Example: data/BC5CDR/dict_core.txt
    - Two columns (i.e., Type, Tokenized Surface) per line.
    - Tab separated.
  - How to obtain?
    - From domain-specific dictionaries.
- Full Dictionary w/o Type Info
  - Example: data/BC5CDR/dict_full.txt
    - One tokenized high-quality phrases per line.
  - How to obtain?
    - From domain-specific dictionaries.
    - Applying the high-quality phrase mining tool on domain-specific corpus.
      - AutoPhrase
Pre-trained word embeddings
- Train your own or download from the web.
- The example run uses embedding/bio_embedding.txt, which can be downloaded from our group's server. For example, curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in the autoner_train.sh.
[Optional] Development & Test Sets.
- Example: data/BC5CDR/truth_dev.ck and data/BC5CDR/truth_test.ck
  - Three columns (i.e., token, Tie or Break label, entity type).
  - I is Berak.
  - O is Tie.
  - Two special tokens <s> and <eof> mean the start and end of the sentence.

Dependencies

The dependent package for this project is listed as below:

numpy==1.13.1
tqdm
torch-scope
pytorch==0.4.1

Command

To train an AutoNER model, please run

./autoner_train.sh

To apply the trained AutoNER model, please run

./autoner_test.sh

You can specify the parameters in the bash files. The variables names are self-explained.

Citation

Please cite the following two papers if you are using our tool. Thanks!

Jingbo Shang*, Liyuan Liu*, Xiaotao Gu, Xiang Ren, Teng Ren and Jiawei Han, "Learning Named Entity Tagger using Domain-Specific Dictionary", in Proc. of 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP'18), Brussels, Belgium, Oct. 2018. (* Equal Contribution)
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.

@inproceedings{shang2018learning,
  title = {Learning Named Entity Tagger using Domain-Specific Dictionary}, 
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei}, 
  booktitle = {EMNLP}, 
  year = 2018, 
}

@article{shang2018automated,
  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
docs		docs
model_partial_ner		model_partial_ner
preprocess_partial_ner		preprocess_partial_ner
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
autoner_test.sh		autoner_test.sh
autoner_train.sh		autoner_train.sh
test_partial_ner.py		test_partial_ner.py
train_partial_ner.py		train_partial_ner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoNER

Model Notes

Benchmarks

Training

Required Inputs

Dependencies

Command

Citation

About

Releases

Packages

Languages

License

MingYates/AutoNER

Folders and files

Latest commit

History

Repository files navigation

AutoNER

Model Notes

Benchmarks

Training

Required Inputs

Dependencies

Command

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages