The parse_xfm model

Introduction

The amrlib parse_xfm module is designed to train and then run inference on sequence-to-sequence style transformer models that are fine-tuned to convert an English sentence to an AMR graph. There are several released models that use this same code module including parse_xfm_bart_large, parse_xfm_bart_base and parse_t5. See amrlib-models for links to download the trained models.

There is no technical paper on the models other than this wiki. For specifics on how to use the code see ReadTheDocs/training.

The Parsing Process

Sequence to Sequence (aka encoder-decoder) transformers models are often used for translating from one language to another. Here we can take advantage of them to translate from an English sentence to a text string version of an AMR graph. There is very little that is AMR specific to the process. We are simply taking a pretrained transformer and fine-tuning it to translate from English to AMR, the same way you would fine-tune one to translate from English to German.

The only thing that is specific to AMR, is the way that the graphs are converted to a text string. The AMR corpus itself has the graphs represented as strings and in theory, this format could be used directly. However, it turns out that the corpus format is a little complicated and simplifying it a bit leads to better outcomes. The format amrlib uses is very close to the original but removes extra spaces, linefeeds and changes the way enumerations are represented. When serializing the graphs it does a depth-first recursion which, I believe, is the way they are serialized for the corpus. If you want to know more, feel free to look at the code at penman_serializer.py.

The training and inference code takes care of converting into and out of this format so the fact that it's happening is invisible to the user.

These fine-tuned transformer models tend to produce some small errors in their output such as missing or extra parentheses. The deserialization code attempts reconstruct the graphs as best it can, and will usually be able to correct for small errors. If the errors are too large and it is unable to properly construct a graph, it will return None for that output. Setting num_beams higher than 1 in inference can fix this issue, as it gives the code a few different output possibilities to work with.

Choice of transformer models

The parse_xfm module is setup to handle any Huggingface/transformers pretrained model that loads with AutoModelForSeq2SeqLM.from_pretrained(). There are a number of these types of models on the Huggingface models site, but bart and t5 are the most commonly used and bart-large has shown to best for the parse task. In my experiments, I have tried several other models, including t5-large (which has about 3X the params of bart-large) and have not found any that beat the bart-large model.

Generally people use a pretrained model supplied by a companies like Facebook (bart) or Google (t5) and then fine-tune them. Theoretically you could pretrain a model from scratch yourself. However, the GPU resources required to do this are considerable so most people rely on the pretrained models to start with.

Recently, decoder-only models such as GPT-X, Chat-GPT, Llama, etc.. have become very popular. The parse_xfm code is not setup to fine-tune these types of models. However, I have tried fine-tuning several of them with different code, and the results are generally worse than a bart model. It appears that the decoder-only style models are just not as good at language translation tasks. Admittedly, the only way I can fine-tune a 7B size LM is to use something like QLoRA and this can cause some quality degradation. Maybe in the future one of these models will show improved scores, but for now bart-large seems to be the best pretrained model available for this task.

Non-amrlib models

This is a big topic that could fill a research paper itself. Here's just a few quick notes on the topic..

The majority of recent papers I see on new parse models tend to be transformer based, and bart-large is the predominant pretrained model used. Models such as SPRING and AMRBart have a similar process for parsing as described here but bring some various enhancements to try to improve on parse scores. Things that people have used to try to improve on scoring including various types of pretraining, data-augmentation with "silver" data and enhancements to the base model itself. At this time, most of the bart-large based models that employ these types of methods still score vary close to the basic implementation here in amrlib.

Older models such as parse-gsii often use a transition based parsing method that is not (fully) transformer based. IBM appears to have a recent model is based on transition parsing.

Parsing Paragraphs vs Sentences

The AMR-3 corpus that these models are trained on is mostly shorter, single sentences entries. There are some "multi-sentence" graphs in the training corpus but multi-sentence is generally a very small number, not a full paragraph of text. The models work best one sentence at a time. The bart-large model itself is limited to 1024 tokens so there is a hard-limit to size of output graph you can get.

If you want paragraph size AMR graphs, you'll have to look into something like DocAMR. AMRLIB doesn't directly support this. IBM's github page may have code for this.

AMR-2 vs AMR-3

There are no amrlib models trained on AMR-2. AMR-3 contains all of AMR-2 plus corrections and additional graphs. This means there isn't much reason to train on it. When you are looking at parsing scores in the literature, be sure to look at which set they are using. AMR-3 has a more challenging test set and tends score a point or so lower than the older AMR-2 test set.

Text Generation

The process to generate text from an AMR graphs is basically the same as described above. The only difference is that instead of translating from English to an AMR graph string, you're doing the reverse. AMRLIB has the generate_xfm code to do this and has a separately released model.

With transformer models, you can generally train them on multiple tasks and fine-tuning a single model to do both English to AMR and AMR to English is completely feasible. In amrlib I choose to keep the models separate for simplicity but others have released models that do both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly