This repository accompanies the COLING 2020 paper Comparison by Conversion: Reverse-Engineering UCCA from Syntax and Lexical Semantics:
@inproceedings{hershcovich-etal-2020-comparison,
title = "Comparison by Conversion: Reverse-Engineering {UCCA} from Syntax and Lexical Semantics",
author = "Hershcovich, Daniel and
Schneider, Nathan and
Dvir, Dotan and
Prange, Jakob and
de Lhoneux, Miryam and
Abend, Omri",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.264",
pages = "2947--2966",
abstract = "Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the mapping between meaning representations from different frameworks using two complementary methods: (i) a rule-based converter, and (ii) a supervised delexicalized parser that parses to one framework using only information from the other as features. We apply these methods to convert the STREUSLE corpus (with syntactic and lexical semantic annotations) to UCCA (a graph-structured full-sentence meaning representation). Both methods yield surprisingly accurate target representations, close to fully supervised UCCA parser quality{---}indicating that UCCA annotations are partially redundant with STREUSLE annotations. Despite this substantial convergence between frameworks, we find several important areas of divergence.",
}
The code is based on the repository, which accompanies the paper HIT-SCIR at MRP 2019: A Unified Pipeline for Meaning Representation Parsing via Efficient Training and Effective Encoding from the CoNLL MRP 2019 Shared Task on Cross-Framework Meaning Representation Parsing, providing code to train models and pre/post-process the MRP dataset.
Changes from the original implementation are:
- Deletion of non-UCCA parsing code, for simplicity. The original code also targeted DM, PSD, EDS and AMR.
- Addition of scripts for interoperability with the UCCA XML format, under
bash/mrp2xml.sh
. The original code only supports the MRP format. - Support for additional features in the input to (the parser model)[modules/transition_parser_ucca.py].
- Modification of the preprocessing scripts and data reader such that the preprocessing now do the CoNLL-U format parsing and save the attributes in a
dict
, rather than saving it as a CoNLL-U string. The data reader therefore does not need to use theconllu
library and can simply read the attributes from thecompanion
field. - Fix to read edge
attributes
from the MRP data and not edgeproperties
(following the renaming of this element in the MRP format). - Experiments with various settings, differing by input features (listed in the paper), under
config/
.
See REPLICATING.md
for instructions on replicating the experiments reported in the paper.
- Python 3.6
- AllenNLP 0.9.0
The full MRP training data is available at mrp-data. Specifically, we use the publicly available UCCA data in MRP format.
After creating a Conda environment or a virtualenv, run
pip install -r requirements.txt
The parser uses BERT Large. To get the BERT checkpoints, run
cd bert/
make
To get the data, augment it with the companion data, and split it to training/validation/evaluation, run
cd data/
make split
For evaluation data given only as input text in MRP format, you need to convert the companion data to conllu format:
python3 toolkit/preprocess_eval.py \
udpipe.mrp \
input.mrp \
--outdir /path/to/output
Based on AllenNLP, the training command is like
CUDA_VISIBLE_DEVICES=${gpu_id} \
TRAIN_PATH=${train_set} \
DEV_PATH=${dev_set} \
BERT_PATH=${bert_path} \
WORD_DIM=${bert_output_dim} \
LOWER_CASE=${whether_bert_is_uncased} \
BATCH_SIZE=${batch_size} \
allennlp train \
-s ${model_save_path} \
--include-package utils \
--include-package modules \
--file-friendly-logging \
${config_file}
Refer to bash/train.sh
for more and detailed examples.
The predicting command is like
CUDA_VISIBLE_DEVICES=${gpu_id} \
allennlp predict \
--cuda-device 0 \
--output-file ${output_path} \
--predictor ${predictor_class} \
--include-package utils \
--include-package modules \
--batch-size ${batch_size} \
--silent \
${model_save_path} \
${test_set}
More examples in bash/predict.sh
.
bash/
command pipelines and examplesconfig/
Jsonnet config filesmetrics/
metrics used in training and evaluationmodules/
implementations of modulestoolkit/
external libraries and dataset toolsutils/
code for input/output and pre/post-processing
We thank the developers of the HIT-SCIR parser.