We collect 14 publicly available datasets as OOD test data and conduct evaluations on 8 classic NLP tasks over popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.
Please checkout these examples from Hugging Face Transformer, to fine-tune your custom models.
The data for all OOD tests can be found here.
Shuibai Zhang (Code work and Experiments Implementation); Linyi Yang (Guidance and Experiments Design); Wei Zhou (Website Implementation)
If you find this work is helpful for your research, please consider to cite the paper as follows.
@article{yang2022glue,
title={GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective},
author={Yang, Linyi and Zhang, Shuibai and Qin, Libo and Li, Yafu and Wang, Yidong and Liu, Hanmeng and Wang, Jindong and Xie, Xing and Zhang, Yue},
journal={arXiv preprint arXiv:2211.08073},
year={2022}
}