Code for Uncertainty Quantification with Pre-trained Language Models: An Empirical Analysis (EMNLP 2022 Findings).
PyTorch = 1.10.1
Bayesian-Torch = 0.1
HuggingFace Transformers = 4.11.1
Our empirical analysis consists of the following three NLP (natural language processing) classification tasks:
task_id | Task | In-Domain Dataset | Out-Of-Domain Dataset |
---|---|---|---|
Task1 | Sentiment Analysis | IMDb | Yelp |
Task2 | Natural Language Inference | MNLI | SNLI |
Task3 | Commonsense Reasoning | SWAG | HellaSWAG |
You can download our input data here and unzip it to the current directory.
Then the corresponding data splits of each task are stored in Data/{task_id}/Original:
- train.pkl, dev.pkl, and test_in.pkl come from the in-domain dataset.
- test_out.pkl comes from the out-of-domain dataset.
Specify the targeting model_name
and task_id
in Code/run.sh:
model_name
is specified in the format of{PLM}_{size}-{loss}
.{PLM}
(Pre-trained Language Model) can be chosen frombert
,xlnet
,electra
,roberta
, anddeberta
.{size}
can be chosen frombase
andlarge
.{loss}
can be chosen frombe
(Brier loss),fl
(focal loss),ce
(cross-entropy),ls
(label smoothing), andmm
(max mean calibration error).
task_id
can be chosen fromTask1
(Sentiment Analysis),Task2
(Natural Language Inference), andTask3
(Commonsense Reasoning).
Other hyperparameters are defined in Code/info.py (e.g., learning rate, batch size, and training epoch).
Use the command bash Code/run.sh
to run one sweep of experiments:
- Transform the original data input in Data/{task_id}/Original to the model-specific data input in Data/{task_id}/{model_name}.
- Train six deterministic (version=
det
) PLM-based pipelines (used forVanilla
,Temp Scaling
(temperature scaling),MC Dropout
(monte-carlo dropout), andEnsemble
) stored in Result/{task_id}/{model_name}. - Train six stochastic (version=
sto
) PLM-based pipelines (used forLL SVI
(last-layer stochastic variational inference)) stored in Result/{task_id}/{model_name}. - Test the above pipelines with five kinds of uncertainty quantifiers (
Vanilla
,Temp Scaling
,MC Dropout
,Ensemble
, andLL SVI
) under two domain settings (test_in
andtest_out
) based on four metrics (ERR
(prediction error),ECE
(expected calibration error),RPP
(reversed pair proportion), andFAR95
(false alarm rate at 95% recall)).- The evaluation of each (uncertainty quantifier, domain setting, metric) combination consists of six trials, and the results are stored in Result/{task_id}/{model_name}/result_score.pkl.
- The ground truth labels and raw probability outputs are stored in Result/{task_id}/{model_name}/result_prob.pkl.
- All the training and testing stdouts are stored in Result/{task_id}/{model_name}/.
We store our empirical observations in results.pkl. You can download this dictionary here.
- The key is in the format of
({task}, {model}, {quantifier}, {domain}, {metric})
.{task}
can be chosen fromSentiment Analysis
,Natural Language Inference
, andCommonsense Reasoning
.{model}
can be chosen frombert_base-br
,bert_base-ce
,bert_base-fl
,bert_base-ls
,bert_base-mm
,bert_large-ce
,deberta_base-ce
,deberta_large-ce
,electra_base-ce
,electra_large-ce
,roberta_base-ce
,roberta_large-ce
,xlnet_base-ce
, andxlnet_large-ce
.{quantifier}
can be chosen fromVanilla
,Temp Scaling
,MC Dropout
,Ensemble
, andLL SVI
.{domain}
can be chosen fromtest_in
andtest_out
.{metric}
can be chosen fromERR
,ECE
,RPP
, andFAR95
. Note thatFAR95
only works with the domain setting oftest_out
.
- The value is in the format of
(mean, standard error)
, which are calculated based on six trials with different seeds.
@inproceedings{xiao2022uncertainty,
title={Uncertainty Quantification with Pre-trained Language Models: An Empirical Analysis},
author={Xiao, Yuxin and Liang, Paul Pu and Bhatt, Umang and Neiswanger, Willie and Salakhutdinov, Ruslan and Morency, Louis-Philippe},
booktitle={Findings of EMNLP},
year={2022}
}