This is the code used in the paper:
"Improving Hypernymy Detection with an Integrated Path-based and Distributional Method"
Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. link
It is used to classify hypernymy relations between term-pairs, using disributional information on each term, and path-based information, encoded using an LSTM.
- Using dynet instead of pycnn (thanks @srajana!)
- Automating corpus processing with a single bash script which is more time and memory efficient
- Too many paths in parse_wikipedia (see issue #2)
To reproduce the results reported in the paper, please use V1. The current version acheives similar results - the integrated model's performance on the randomly split dataset is: Precision: 0.918, Recall: 0.907, F1: 0.912
Consider using our new project, LexNET! It supports classification of multiple semantic relations, and contains several model enhancements and detailed documentation.
Prerequisites:
Quick Start:
The repository contains the following directories:
- common - the knowledge resource class, which is used by other models to save the path data from the corpus.
- corpus - code for parsing the corpus and extracting paths, including the generalizations made for the baseline method.
- dataset - code for creating the dataset used in the paper, and the dataset itself.
- train - code for training and testing both variants of our model (path-based and integrated).
To create a processed corpus, download a Wikipedia dump, and run:
bash create_resource_from_corpus.sh [wiki_dump_file] [resource_prefix]
Where resource_prefix
is the file path and prefix of the corpus files, e.g. corpus/wiki
, such that the directory corpus
will eventually contain the wiki_*.db
files created by this script.
To train the integrated model, run:
train_integrated.py [resource_prefix] [dataset_prefix] [model_prefix_file] [embeddings_file] [alpha] [word_dropout_rate]
Where:
resource_prefix
is the file path and prefix of the corpus files, e.g.corpus/wiki
, such that the directorycorpus
contains thewiki_*.db
files created bycreate_resource_from_corpus.sh
.dataset_prefix
is the file path of the dataset files, e.g.dataset/rnd
, such that this directory contains 3 files:train.tsv
,test.tsv
andval.tsv
.model_prefix_file
is the output directory and prefix for the model files. The model is saved in 3 files:.model
,.params
and.dict.
In addition, the test set predictions are saved in.predictions
, and the prominent paths are saved to.paths
.embeddings_file
is the pre-trained word embeddings file, in txt format (i.e., every line consists of the word, followed by a space, and its vector. See GloVe for an example.)alpha
is the learning rate (default=0.001).word_dropout_rate
is the... word dropout rate.
Similarly, you can train the path-based model with train_path_based.py
or test any of these pre-trained models using test_integrated.py
and test_path_based.py
respectively.