This repo contains the code for our paper "Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases". In particular, it contains code to train various models that are debiased, meaning they are trained to avoid using particular strategies that are known to work well on the training data, but do not generalize to out-of-domain of adversarial settings.
Code for our VQA experiments is in a separate repo.
Details and links to the TriviaQA-CP dataset we constructed are in the triviaqa_cp folder.
This repo contains code to run our debiasing-methods on four test cases:
- MNLI modified to contain a synthetic bias
- MNLI with HANS as the test set
- SQuAD with Adversarial SQuAD as the test set
- The TriviaQA-CP datasets, which we construct from TriviaQA
Our VQA experiments are in separate repo.
Our implementation exists in the debias folder, and uses tensorflow 1.13.1.
The MNLI task has an alternative, BERT implementation using pytorch in debias/bert.
Details and download links for the dataset we constructed, TriviaQA-CP, can be found in the triviaqa_cp folder.
The actual implementation of the ensemble loss functions can be found in three places:
debias/modules/clf_debias_loss_functions.py
for tensorflow classification modelsdebias/modules/qa_debias_loss_functions.py
for tensorflow QA modelsdebias/bert/clf_debias_loss_functions.py
for pytorch classification models
We require python>=3.6 and tensorflow 1.13.1. Additional requirements are are in
requirements.txt
To install, make sure tensorflow 1.13.1 is installed, then run:
pip3 install -r requirements.txt
The bert implementation additionally requires pytorch 1.1.0, and the hugging-face pre-trained transformer module.
Scripts will automatically download any data they need. See config.py for the download locations, by default everything will be downloaded to ./data. The first time models are run be patient, some of the downloads can take a while.
To download everything beforehand, run python debias/download_all.py
All the data takes about 2.1G.
On a fresh machine with Ubuntu 18.04, I got the tensorflow code running by installing Cuda 10.0, Cudnn 7.6.2, and running:
sudo apt install pip3
pip3 install tensorflow-gpu==1.13.1
pip3 install -r requirements.txt
Each task has a corresponding script in debias/experiments/train_debiased_*
.
Scripts take command line options
to specify which debiasing method to use, and sometimes additional options to specify
different variations of the task.
For example, to train a model SQuAD with the TFIDF Filtered bias:
python debias/experiments/train_debiased_squad.py --bias tfidf_filtered --output_file /path/to/output
Or to train a model on the TriviaQA-CP Location dataset with the Reweight method:
python debias/experiments/train_debiased_triviaqa_cp.py --dataset location --output_dir /path/to/output --mode reweight
See the command line options (i.e., python debias/experiments/train_debiased_squad.py --help
)
for additional options.
Model are automatically evaluated after training, but can be re-evaluated using the evaluation scripts debias/experiments/eval_*
The BERT for HANS has its own script, its does not require tensorflow to be installed, but does require pytorch 1.1 and pytorch-pretrained-bert It can be run by:
python debias/bert/train_bert.py --do_train --do_eval --output_dir /path/to/output
I highly recommend using the --fp16
flag as well if you have apex installed, its about 2x faster.
Results should match the numbers in our paper, although please note that individual runs have a moderately high variance.
Our SQuAD and TriviaQA models require data to be pre-processed using CoreNLP, and most experiments require pre-training a bias-only model. The experiment scripts will download cached pre-processed data instead of re-building them to help ensure our main experiments are easy to reproduce, so they do NOT require completing these steps yourself.
The bias-only model for MNLI can be trained with
python debias/preprocessing/build_mnli_bias_only.py /path/to/output/dir
The CoreNLP annotated data can be built by starting a CoreNLP server, (we used release 2018-10-05 v3.9.2). For example (run from inside the corenlp directory):
java -mx8g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -threads 8 -quiet
and then running:
python debias/preprocessing/build_annotated_squad.py /path/to/source /path/to/output.pkl --port 9000
python debias/preprocessing/build_annotated_triviaqa.py /path/to/source /path/to/output.pkl --port 9000
These scripts can be slow, but support multiprocessing with the --n_processes
flag (in which case the CoreNLP server
should be given multiple threads as well, as in the example).
Note for unknown reasons I have had seen very minor discrepancies between the output of these scripts and our cached
NER tags (overall about 0.5% of tokens). The POS tags and tokens match the cached data exactly.
I don't expect this to significantly alter results.
The SQuAD bias-only model can be trained by:
python debias/preprocessing/train_squad_bias.py path/to/model_dir path/to/prediction/dir
I am still working on uploading the TriviaQA-CP bias-only model.
We present the results on HANS with the addition of max, min, and standard deviations for our 8 runs below.
For Bert:
Debiasing Method | Mean | Std | Min | Max |
---|---|---|---|---|
None | 62.40 | 2.35 | 57.97 | 65.98 |
Reweight | 69.19 | 3.54 | 62.53 | 74.52 |
Bias Product | 67.92 | 3.71 | 60.63 | 71.51 |
Learned-Mixin | 64.00 | 3.03 | 57.49 | 68.01 |
Learned-Mixin +H | 66.15 | 2.57 | 60.59 | 68.55 |
For the Recurrent Model:
Debiasing Method | Mean | Std | Min | Max |
---|---|---|---|---|
None | 50.58 | 0.39 | 49.81 | 51.05 |
Reweight | 52.85 | 0.69 | 51.58 | 53.88 |
Bias Product | 53.69 | 1.07 | 52.02 | 55.63 |
Learned-Mixin | 51.65 | 0.58 | 50.60 | 52.25 |
Learned-Mixin +H | 53.35 | 1.04 | 51.97 | 54.82 |
If you use this work, please cite:
"Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases". Christopher Clark, Mark Yatskar, Luke Zettlemoyer. In EMNLP 2019.