-
Notifications
You must be signed in to change notification settings - Fork 10
UCCA preprocessing
Matthias Lindemann edited this page Jul 23, 2019
·
5 revisions
mkdir -p data_ucca/companion_tokens
python ucca/get_companion_tokenization.py data_ucca/companion/ucca/ data_ucca/companion_tokens/
mkdir -p data_ucca/alto_corpus
python ucca/convert_training_into_alto_corpus.py data_ucca/ucca/ data_ucca/companion_tokens/ data_ucca/alto_corpus/
This will take a few minutes, and will then write two files in Alto format, training.txt and dev.txt, to the alto_corpus directory. Moreover, it produces two MRP files that correspond to the split.
Once we run CreateCorpus from am-tools, we obtain an am-conll file. This file will serve as input to the am-parser model.
mkdir -p data_ucca/amconll_corpus
java -cp ../am-tools/build/libs/am-tools-all.jar de.saar.coli.amrtagging.formalisms.ucca.tools.CreateCorpusParallel -c data_ucca/alto_corpus/training.txt -o data_ucca/amconll_corpus -p training --companion data_ucca/companion/all_ucca.conllu
This will take about an hour. Feel free to specify a timeout to speed this up.
The outcome will be two files in the amconll_corpus
subdirectory: one AM-CoNLL file and one file with supertags.
To find out how accurately the contractions can be reversed, continue as follows:
java -Xmx8G -cp ../am-tools/build/libs/am-tools-all.jar de.saar.coli.amrtagging.mrp.tools.EvaluateMRP --corpus data_ucca/amconll_corpus/training.amconll --out data_ucca/training.mrp
python ucca/decompress_mrp.py data_ucca/training.mrp data_ucca/training_uncontracted.mrp
python ucca/remove_labels.py data_ucca/training_uncontracted.mrp data_ucca/training_uncontracted_no_labels.mrp
python ../mtool/main.py --read mrp --score mrp --gold data_ucca/ucca/ewt.mrp data_ucca/training_uncontracted_no_labels.mrp