HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) is a dataset for generative information-seeking scenarios. It is constructed on top of MIRACL 🌍🙌🌏, an information retrieval dataset that consists of queries along with a set of manually labelled relevant passages (quotes).
We collect attributed explanations for each question by eliciting prompts from GPT-3.5, based on the given relevant passages. The explanations adhere to an in-context citation style, similar to scientific articles, that reference the supporting quotes. We then ask human annotators to judge the explanations based on two criteria:
- Informativeness: whether they provide a direct answer to the question.
- Attributability: whether they are attributable to the source passages.
HAGRID is hosted on Hugging Face 🤗: link.
import datasets
hagrid = datasets.load_dataset("miracl/hagrid", split="train")
print(hagrid[0])
Split | #Q | #A | #Informativeness | #Attribuatability |
---|---|---|---|---|
Train | 1,922 | 3,214 | 3,214 | 754 |
Dev | 716 | 1,318 | 1,157 | 826 |
We are planning to release baseline models soon! Stay tuned!
If you have any questions, feel free to email us (project.miracl [at] gmail.com) or start a Github issue under this repository.
This work is licensed under the Apache 2 license. See LICENSE for details.
If you find this dataset and repository helpful, please cite HAGRID as follows:
@article{hagrid,
title={{HAGRID}: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution},
author={Ehsan Kamalloo and Aref Jafari and Xinyu Zhang and Nandan Thakur and Jimmy Lin},
year={2023},
journal={arXiv:2307.16883},
}