Skip to content

coli-saar/data-augmentation-compgen

Repository files navigation

Simple and effective data augmentation for compositional generalization

This repository contains the code for the paper Simple and effective data augmentation for compositional generalization.

Environment

The code is tested with Python 3.8.16. To install the required packages, run:

pip install -r requirements.txt

To evaluate the execution accuracy for GeoQuery, additionally install SWI-Prolog following the instructions and set the val_geo_acc value to true in the configuration file (e.g. configs/geoquery/baseline/T5.jsonnet for GeoQuery dataset.

Datasets

We used COGS, CFQ, GeoQuery, and SCAN datasets in our experiments. For GeoQuery, we used the preprocessed data from https://github.com/namednil/f-then-r. To use the provided scripts, download the datasets and place them in the data directory as below:

data
├── cogs
│   ├── train.tsv
│   ├── dev.tsv
│   ├── test.tsv
│   ├── gen.tsv
├── cfq
│   ├── cfq1.1.tar.gz
├── geoquery
│   ├── data.zip
└── scan
    └── SCAN
        ├── add_prim_split
        │   ├── tasks_train_addprim_turn_left.txt
        │   └── tasks_test_addprim_turn_left.txt
        └── length_split
            ├── tasks_train_length.txt
            └── tasks_test_length.txt

Preprocessing

To preprocess the datasets, run:

./preprocess.sh

This will create the preprocessed data in the data directory.

Data augmentation

To augment the data, run:

./prepare_augmentation.sh $dataset $augmentation_distribution

where $dataset is one of cfq, geoquery, or scan and $augmentation_distribution is one of uniform, train, or test. For COGS, we use the provided grammar from the original paper to sample novel meaning representations.

This augmentation scripts consist of three steps:

  1. Estimate the PCFG from the given dataset and sample meaning representations from it.
  2. Use the existing backtranslation model to translate samples to natural language.
  3. Postprocess the output file and place it in the proper directories under data/$dataset/pcfg/.

Training

To train the model, run:

./train.sh $config $seed

where $config is one of the configuration files in the configs directory and $seed is the random seed. This will create a directory under model_archives with the model checkpoints and logs.

Evaluation

To evaluate the model, run:

./evaluate.sh $archive_path $data_path

where $archive_path is the path to the model archive directory, $data_path is the path to the test data.

Citation

@inproceedings{yao-koller-2024-simple,
    title = "Simple and effective data augmentation for compositional generalization",
    author = "Yao, Yuekun  and
      Koller, Alexander",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.25",
    pages = "434--449",
    abstract = "Compositional generalization, the ability to predict complex meanings from training on simpler sentences, poses challenges for powerful pretrained seq2seq models. In this paper, we show that data augmentation methods that sample MRs and backtranslate them can be effective for compositional generalization, but only if we sample from the right distribution. Remarkably, sampling from a uniform distribution performs almost as well as sampling from the test distribution, and greatly outperforms earlier methods that sampled from the training distribution.We further conduct experiments to investigate the reason why this happens and where the benefit of such data augmentation methods come from.",
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published