Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Description

This repo contains a diagnostic evaluation benchmark toward the robustness of text-to-SQL models, which contains 17 perturbation test sets to measure the robustness of models from different angles. It is released along with our ICLR 2023 paper: Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. The details can be found in our paper.

The dataset is created using the dev set in the Spider dataset and our changes to the Spider dataset are to supplement the work done by Spider.

Preprocessing

First, unzip the data using the following command.

mkdir data
tar -xvf data.tar.gz -C data

Run data_preprocess.py to copy pre-perturbed databases and tables from the original spider devlopment set.

python data_preprocess.py

To Use

Each folder contains a perturbation test set. There are 3 DB perturbation test sets (starting with DB_), 9 NLQ perturbation test sets (starting with NLQ_), and 5 SQL perturbation test sets (starting with SQL_). Each test contains parallel pre-perturbation and post-perturbation test data.

DB_*: data with DB perturbation, which contain two database folders, two table files, and two question files, corresponding to pre-perturbation and post-perturbation data.
NLQ_*: data with NLQ perturbation, which contain a single database folder, table file, and two question files (one for pre-perturbation and the other for post-perturbation).
SQL_*: data with SQL perturbation, which contain a single database folder, table file, and two question files (one for pre-perturbation and the other for post-perturbation).

First, run the model on Spider-dev set to get the predicted SQL queries and put it in predictions/Spider-dev/[model_name]/pred.sql. Then, run the model on each post-perturbation set to get the predicted SQL queries in predictions/[perturbation_namq]/[model_name]/pred.sql.

To Evaluate a Model

Run copy_pre_perturbation_predictions.py to generate copy the SQL prediction in Spider-dev to all pre-perturbation sets. Evalaute the model on each pre-perturbation and post-perturbation set using the test-suite evaluation.

python copy_pre_perturbation_predictions.py --model [model_name]

Leaderboard

Pre-perturbation and post-perturbation accuracy in terms of execution (EX)

The EX accuracy of models on pre-perturbation and post-perturbation data. We report the marco average results of the perturbation test sets in DB, NLQ, SQL sets. x-y represents the accuracy on pre-perturbation data and post-perturbation data.

Evaluation of Finetuned Models

Model	Average of DB perturbation test sets	Average of NLQ perturbation test sets	Average of SQL perturbation test sets	Average of all test sets
Picard	78.9-55.0	76.0-65.0	76.3-74.0	76.6-65.9
SmBoP	74.7-50.0	76.6-58.1	74.7-72.2	75.7-60.8
T5-3B LK	73.5-47.0	70.4-58.9	71.7-69.6	71.3-59.9
T5-3B	69.5-42.9	68.2-54.9	70.9-69.5	69.2-57,1
T5-large	64.0-36.7	63.6-50.9	65.6-64.7	64.2-54.2
RatSQL	70.8-33.9	70.2-50.7	68.8-62.4	69.9-51.5
T5-base	51.1-22.8	50.0-32.6	56.9-51.8	54.3-40.6

Evaluation of In-context Learning Methods

Model	Average of DB perturbation test sets	Average of NLQ perturbation test sets	Average of SQL perturbation test sets	Average of all test sets
Codex	72.6-60.7	75.3-60.8	74.6-73.1	74.6-64.4

Citation and Contact

If you use the dataset in your work, please cite our paper and the Spider paper.

@article{chang2023dr,
  title={Dr. Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness},
  author={Chang, Shuaichen and Wang, Jun and Dong, Mingwen and Pan, Lin and Zhu, Henghui and Li, Alexander Hanbo and Lan, Wuwei and Zhang, Sheng and Jiang, Jiarong and Lilien, Joseph and others},
  journal={arXiv preprint arXiv:2301.08881},
  year={2023}
}

@inproceedings{yu2018spider,
  title={Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={3911--3921},
  year={2018}
}

Please contact Shuaichen Chang (chang.1692[at]osu.edu) for questions and suggestions.

Acknowledgement

We thank the authors of Spider for allowing us to redistribute the data in Spider development set.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-ANNOTATIONS		LICENSE-ANNOTATIONS
NOTICE		NOTICE
README.md		README.md
copy_pre_perturbation_predictions.py		copy_pre_perturbation_predictions.py
data.tar.gz		data.tar.gz
data_preprocess.py		data_preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Description

Preprocessing

To Use

To Evaluate a Model

Leaderboard

Pre-perturbation and post-perturbation accuracy in terms of execution (EX)

Citation and Contact

Acknowledgement

About

Licenses found

Releases

Packages

Contributors 3

Languages

License

Licenses found

awslabs/diagnostic-robustness-text-to-sql

Folders and files

Latest commit

History

Repository files navigation

Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Description

Preprocessing

To Use

To Evaluate a Model

Leaderboard

Pre-perturbation and post-perturbation accuracy in terms of execution (EX)

Citation and Contact

Acknowledgement

About

Resources

License

Licenses found

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages