This repository contains code for the paper How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
Our code is mainly based on T5 and mesh-tensorflow and runs on TPUs.
Please follow the original T5 repository to properly setup TPUs.
To install required packages, download T5 (version 0.6.4) and mesh-tensorflow (version 0.1.16) and copy source files into the t5
and mesh_tensorflow
folder.
Don't replace files already in these folders because those files are the files we modified for calibration purpose.
Run the following commands to fine-tune the UnifiedQA models with softmax
or margin
objective functions.
$tpu
specifies the name of the TPU, $model_output
specifies the output location to save the fine-tuned model, $objective
specifies the objective function to use.
./finetune.sh $tpu 3B $model_output $objective uq_clean_train_ol_mix train mc
Run the following commands to evaluate the probabilities of candidate answers.
$score_output
specifies the location to save the output, and 1103000
specifies the checkpoint to use.
./score.sh $tpu $score_output $model_output 1103000 uq_clean_test dev
Run the following commands to compute the ECE metric given the probabilities of candidate answers.
python cal.py --mix uq_clean_test --split dev --score $score_output