-
Notifications
You must be signed in to change notification settings - Fork 192
Load your custom dataset
Nandan Thakur edited this page Jun 29, 2022
·
1 revision
You can load a custom preprocessed dataset in the following way:
from beir.datasets.data_loader import GenericDataLoader
corpus_path = "your_corpus_file.jsonl"
query_path = "your_query_file.jsonl"
qrels_path = "your_qrels_file.tsv"
corpus, queries, qrels = GenericDataLoader(
corpus_file=corpus_path,
query_file=query_path,
qrels_file=qrels_path).load_custom()
Make sure that the dataset is in the following format:
- corpus file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields
_id
with unique document identifier,title
with document title (optional) andtext
with document paragraph or passage. For example:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
- queries file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with two fields
_id
with unique query identifier andtext
with query text. For example:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
- qrels file: a .tsv file (tab-seperated) that contains three columns, i.e. the query-id, corpus-id and score in this order. Keep 1st row as header. For example:
q1 doc1 1
You can also skip the dataset loading part and provide directly corpus, queries and qrels in the following way:
corpus = {
"doc1" : {
"title": "Albert Einstein",
"text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
its influence on the philosophy of science. He is best known to the general public for his mass–energy \
equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
of the photoelectric effect', a pivotal step in the development of quantum theory."
},
"doc2" : {
"title": "", # Keep title an empty string if not present
"text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
},
}
queries = {
"q1" : "Who developed the mass-energy equivalence formula?",
"q2" : "Which beer is brewed with a large proportion of wheat?"
}
qrels = {
"q1" : {"doc1": 1},
"q2" : {"doc2": 1},
}