GitHub - CoventryResearch/mimlre-ext: Our extensions to the stanford slot-filling system

CoventryResearch / mimlre-ext Public

forked from ajaynagesh/mimlre-ext

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Our extensions to the stanford slot-filling system

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
config		config
notes		notes
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.ajay		README.ajay
ground_truth.txt		ground_truth.txt
ground_truth_sorted.txt		ground_truth_sorted.txt
run.sh		run.sh
tmp.properties		tmp.properties
tmp.properties.hoffmann		tmp.properties.hoffmann

Repository files navigation

Ajay: 08/10/2013
-----

Some changes the original stanford slot-filling system. Also adding the code to git for easy access and collaboration.

Files/Dirs needed but not added to the repository
-----
lib
resources
small_dataset

Ajay: 11/11/2013 -- simple change to test git and github setup on Mumbai machine

INTRODUCTION

This release contains all source code necessary to replicate our
results from the paper on multi-instance multi-label learning for relation
extraction at EMNLP 2012 (see References).
Additionally, this package contains all the source code for our KBP
slot-filling system (but not all the data necessary to replicate our KBP
2011 results - see the Usage section for details).

Please note that this is research code. It is not optimized, slow, not very
clean, and not well documented. Additionally, it contains several other
experiments we tried for the TAC-KBP shared task, which are probably not
relevant for most people.

Nevertheless, the models we proposed in the EMNLP paper, which is what I think
most people will care about, are implemented relatively cleanly and are
isolated in the following classes:
The MIML-RE model:
edu.stanford.nlp.kbp.slotfilling.classify.JointBayesRelationExtractor
The Mintz++ model:
Same class as the above but instantiate it with the parameter onlyLocal = true
Our implementation of the Hoffmann model:
edu.stanford.nlp.kbp.slotfilling.classify.HoffmannExtractor
The classifier we used at KBP 2011:
edu.stanford.nlp.kbp.slotfilling.classify.OneVsAllRelationExtractor

The main entry points for the experiments are:
For the experiments with the Riedel dataset:
edu.stanford.nlp.kbp.slotfilling.MultiR
For the KBP system:
edu.stanford.nlp.kbp.slotfilling.KBPTrainer
See the Usage section for the actual command lines.

AUTHORS

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Sonal Gupta, John Bauer,
David McClosky, Angel X. Chang, Valentin I. Spitkovsky, and
Christopher D. Manning

REFERENCES

If you are interested in the model we published at EMNLP 2012,
please cite this paper:

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning.
Multi-instance Multi-label Learning for Relation Extraction.
Proceedings of the 2012 Conference on Empirical Methods in Natural Language
Processing and Natural Language Learning (EMNLP-CoNLL), 2012.

If you are interested in our KBP system, please cite this paper:

Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang,
Valentin I. Spitkovsky, Christopher D. Manning.
Stanford's Distantly-Supervised Slot-Filling System.
Proceedings of the TAC-KBP 2011 Workshop, 2011.

ACKNOWLEDGMENTS

We gratefully thank Raphael Hoffmann and Sebastian Riedel for sharing their
data and code and for the many helpful discussions. This release includes the
data generated by Sebastian Riedel and re-packaged by Raphael Hoffmann
(available only under certain conditions - ask mihais AT stanford DOT edu for
details). One of our models
(edu.stanford.nlp.kbp.slotfilling.classify.HoffmannExtractor) replicates
Raphael Hoffmann's best model from his ACL 2011 paper.

We thank the organizers of the TAC-KBP shared tasks for all their effort.

We thank SRI (and in particular Lynn Voss) for being very responsive to
Mihai's annoying questions and publicly releasing their gazetteer
(faust-gazetteer).

LICENSING

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program (see the file LICENSE.txt); if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

For the license for all the jar files include in the lib/ directory, please
see the file LIBRARY_LICENSES.txt.

USAGE

If you are interested in the experiments on the Riedel datasat (top part of
Figure 4 in the EMNLP 2012 paper), please note that that corpus is not included
in this distribution due to licensing reasons. Please contact Mihai Surdeanu
(mihais AT stanford DOT edu) for details.

This release includes all relevant source code in the src/ directory and the
corresponding Java classes in the classes/ directory. The sources were compiled
with Java 1.6. If, for any reason, you decide to recompile, just type "ant all".

To replicate the experiments in our EMNLP paper, please follow the instructions
below.

------------------------------------------------------------------------------
For the experiments with the Riedel dataset (top part of Figure 4 in the paper),
use the following commands:

To generate the Mintz++ curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \
config/multir/multir_mintz.properties
To generate the MIML-RE curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \
config/multir/multir_mimlre.properties
To generate the MIML-RE At-Least-One curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \
config/multir/multir_mimlre_atleastone.properties

The runtime for these models ranges from approximately 30 minutes for the
Mintz++ model to approximately 6 hours for the MIML-RE models. At the end of
each run, the system prints the precision, recall and F1 scores for the last
point in the P/R curve, and the location of the file with the data for the
entire P/R curve. For example, the last two lines in the output for the
MIML-RE run are:

P 0.2803560076287349 R 0.22615384615384615 F1 0.2503548112404201
P/R curve values saved in file corpora/multir/multir_JOINT_BAYES_T5_E15_NF5_Fall_M1_Istable_Ytrue.curve

In the .curve file, the precision values are in column 3, the recall values
are in column 5, and the corresponding F1 scores in column 7.
Additionally, each run saves the models learned (after each epoch where
applicable) in the corpora/multir directory, in files with the same prefix as
the .curve file but with the .ser extension. Because of this, any subsequent
run of the above commands will be much faster (a couple of minutes).

------------------------------------------------------------------------------
For the experiments with the KBP dataset (bottom part of Figure 4 in the EMNLP
2012 paper), please follow the instructions below.

To generate the Hoffmann curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_hoffmann.properties
To generate the Mintz++ curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_mintz.properties
To generate the MIML-RE curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_mimlre.properties
To generate the MIML-RE At-Least-One curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_mimlre_atleastone.properties

The runtime for these models ranges from approximately 3 hours for the
Mintz++ model and our implementation of the Hoffmann model to 20 hours for
the MIML-RE model.
At the end of each run, the system prints the KBP score for the last point
in the P/R curve, and the location of the file with the data for the entire
P/R curve. For example, the last lines in the output for the MIML-RE run are:

2010 scores:
Recall: 177 / 576 = 0.30729166
Precision: 177 / 728 = 0.24313186
F1: 0.2714724
Jun 9, 2012 4:59:44 PM edu.stanford.nlp.kbp.slotfilling.common.Log severe
SEVERE: P/R curve data generated in file: corpora/kbp/mimlre.curve

The format of the .curve file is the same as above. Similarly with the MultiR
runs, the KBP runs save the models learned (after each epoch where applicable)
in the corpora/kbp directory, in files with the prefix given by the
serializedRelationExtractorPath property and with the .ser extension. Because
of this, any subsequent run of the above commands will be much faster (less
than 10 minutes). Note that these runs generate a few additional files, with
names starting with the prefix given by the value of the kbp.runid property.
These files serve only debug purposes and can be safely removed.

------------------------------------------------------------------------------
This package is insufficient to replicate our KBP 2011 results. The release
includes our entire KBP source code but not all the data necessary for the
experiments. For example, for our EMNLP 2012 experiments we fetched a maximum
of 50 sentences per entity from Wikipedia and the official KBP corpus. For the
KBP experiments we fetched up to 500 sentences per entity and, additionally,
we used data from web snippets. To repeat these experiments, you would need
access to all our indices, which are very large (hundreds of GB). If you are
seriously interested in this project, please contact me (mihais AT stanford
DOT edu) directly, and we will arange the transfer of data.