extract lexicalized tree adjoin grammar from treebank
This project intends to extract Tree Adjoining Grammars with semantics aligned from KBGen corpus.
- Leiningen 2.0.0: https://github.com/technomancy/leiningen. Note: We also provide executable jar file. So if you don't want to compile or use the REPL, it's not required. * Included in utilities/ (lein for UNIXes, lein.bat for Windows)
- Python 2.7: http://www.python.org/download/releases/2.7/
- Java 7.10: http://www.java.com/en/download/index.jsp
- NLTK 2.0.4 is a leading platform for building Python programs to work with human language data. http://nltk.org/ * Included in utilities/, referenced automatically by run.sh * Earlier versions might not work!
- Stanford parser 2.0.4 http://nlp.stanford.edu/software/lex-parser.shtml * Included in utilities/, used by the parse.sh script.
To reproduce our current result, you can either simply run
bin/run.sh
or follow the pipeline described below:
- Deal with the conjunction occurred in the syntactic tree.
- Parse sentences using Stanford parser. We use the unlexicalized parser with head information output.
- Normalize the syntactic tree gotten from step 2.
- Extract TAG from the output of step 3
- Assign semantics to the output of step 4
To do the coordination aggregation, run
java -jar bin/aggregation-0.1.1-SNAPSHOT-standalone.jar \
input/triples/ output/aggregated/
`
To parse the corpus using the Stanford parser, run
bin/parse.sh input/sentences/ output/parsed/
To normalize the syntactic tree, run
java -jar bin/grook-0.1.0-SNAPSHOT-standalone.jar \
output/parsed/ output/fixed/
To extract the TAG with semantics aligned, run
PYTHONPATH="utilities/nltk-2.0.4/:$PYTHONPATH" python2 bin/extract/extractor.py \
output/fixed/ input/alignments/ output/final.gram \
--verbose output/grammar-verbose/
For more details, try running
python2 extractor.py -h
usage: extractor.py [-h] [--verbose VERBOSE] corpus alignment [outfile]
positional arguments:
corpus corpus path which should be a directroy
alignment alignment path which should be a directory
outfile outputfile for extracted grammar
optional arguments:
-h, --help show this help message and exit
--verbose VERBOSE output raw gammar extracted for each sentence. This
parameter should be a directory
to check the help.
We also provide a small tool to help you visualize TAG extracted from step 4 or step 5, run
python2 grammarviewer.py -h
usage: grammarviewer.py [-h] [filename]
Draw the tree according to grammar file
positional arguments:
filename The name of grammar file, stdin will be used if left open
optional arguments:
-h, --help show this help message and exit
As a side product, our package provides a s-expression parser for python. You may want to use it to reconstruct ParentedTree(NLTK) from the plain text representation of TAG.
- ./bin contains all runnable programs and scripts
- ./src contains all the src code
- ./output contains the intermediate results generated by the programs.
- ./input contains the original corpus, annotated data
- ./input/alignment contains our annotation result
- ./input/heads-fixed
- ./input/aggregation
- ./report contains our report