Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure
We present the Molweni dataset, a machine reading comprehension (MRC) dataset built over multiparty dialogues. Molweni also uniquely contributes discourse dependency annotations for its multiparty dialogues. The following shows a multiparty dialogue with four speakers and seven utterances. Due to the property of dialogues, the dialogue instances in our corpus could have grammar mistakes. There are two versions of the Molweni dataset that can be respectively used for machine reading comprehension (MRC) and discourse parsing (DP). Both two versions of Molweni come from the same 10K dialogue, but the data partition is different.
Speaker | Utterance |
---|---|
nbx909 |
how do i find the address of a usb device ? |
Likwidoxigen |
try taking it out to dinner and do a little wine and dine and it shoudl tell ya |
Likwidoxigen |
what sort of device ? |
babo |
ca n’t i just copy over the os and leave the data files untouched ? |
nbx909 |
only if you do an upgrade |
Nuked |
should i just restart x after installing |
Likwidoxigen |
i ’d do a full restart so that it re-loads the modules |
For machine reading comprehension (MRC) , there are two types of questions in our corpus, namely, answerable questions and unanswerable questions. In particular, most of the questions in our dataset are questions leading by Why, What, Who, Where, When and How. Only a small part of the questions are other questions in our dataset, such as questions leading by Do, Which, and Whose.
- Q1: Why does Likwidoxigena full restart?
- A1: it re-loads the modules
- Q2: What does nbx909 want to do?
- A2: find the address of a usb device
- Q3: How to restart network?
- A3: NA.
Our dataset derives from the large scale multiparty dialogues dataset the Ubuntu Chat Corpus (Lowe et al., 2015). The Ubuntu dataset is a large scale multiparty dialogues corpus. We adopt the dialogues with 8-15 utterances and 2-9 speakers. To simplify the task, we filter the dialogues with long sentences (more than 20 words). We choose 10,000 dialogues with 88,303 utterances from the Ubuntu dataset , including 78,245 discourse relations and 30,046 questions for machine reading comprehension. The average speakers per dialogue in our dataset is 3.52 , the average and max length of dialogues are respectively 8.83 and 14 utterances. The number of answerable questions and unanswerable questions are 25,759 and 4,287.
Metric | Number |
---|---|
Avg./Max. of speakers per dialogue | 3.51 / 9 |
Avg./Max. question length (in tokens) | 5.91 / 19 |
Avg./Max. answer length (in tokens) | 4.08 / 19 |
Avg./Max. dialogue length (in tokens) | 104.4 / 208 |
Avg./Max. dialogue length (in utterances) | 8.82 / 14 |
Avg./Max. utterance length (in tokens) | 10.8 / 19 |
Vocabulary size | 24,615 |
Answerable questions | 25,779 |
Unanswerable questions | 4,287 |
- Latest release: v1.0
The overview of Molweni for MRC(withDiscourse):
Dataset | Dialogues | Utterances | Questions |
---|---|---|---|
TRAIN | 8,771 | 77,374 | 24,682 |
DEV | 883 | 7,823 | 2,513 |
TEST | 100 | 845 | 2,871 |
The overview of Molweni for Discourse parsing:
Dataset | Dialogues | Utterances | Relations |
---|---|---|---|
TRAIN | 9,000 | 79,487 | 70,454 |
DEV | 500 | 4,386 | 3,880 |
TEST | 500 | 4,430 | 3,911 |
-
Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure.Jiaqi Li, Ming Liu, Min-Yen Kan, Zihao Zheng, Zekun Wang, Wenqiang Lei, Ting Liu, Bing Qin. Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).
-
@inproceedings{li-etal-2020-molweni,
title = "Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure",
author = "Li, Jiaqi and
Liu, Ming and
Kan, Min-Yen and
Zheng, Zihao and
Wang, Zekun and
Lei, Wenqiang and
Liu, Ting and
Qin, Bing",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.238",
pages = "2642--2652",
abstract = "Research into the area of multiparty dialog has grown considerably over recent years. We present the Molweni dataset, a machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni{'}s source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances. We annotate 30,066 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations in a modified Segmented Discourse Representation Theory (SDRT; Asher et al., 2016) style for all of its multiparty dialogs, contributing large-scale (78,245 annotated discourse relations) data to bear on the task of multiparty dialog discourse parsing. Our experiments show that Molweni is a challenging dataset for current MRC models: BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7{%} F1 on Molweni{'}s questions, a 20+{%} significant drop as compared against its SQuAD 2.0 performance.",
}
ALL the datas are in the "data" field. And the "data" field has two keys . The first is "title" which shows the kind of data , including test,train and dev. The scond is "dialogues" which is a list of every dialogue's information .
{
"data": {
"title": "test",
"dialogues": [
{
"edus": [
{
"text": "that i use between linux and windows",
"speaker": "JuJuBee_"
},
{
"text": "did you mount it with fstab ? give us a pastebin of the fstab that is probably it eh.EMOJI",
"speaker": "nit-wit"
},
{
"text": "it 's treated as a mount mask",
"speaker": "ikonia"
}
],
"context": "jujubee_: that i use between linux and windows nit-wit: did you mount it with fstab ? give us a pastebin of the fstab that is probably it eh.emoji ikonia: it 's treated as a mount mask",
"qas": [
{
"question": "Where does JuJuBee_ use ?",
"id": "3f45971d1b432a674e1098bb80ebe70c",
"answers": [
{
"text": "between linux and windows",
"answer_start": 21
}
],
"is_impossible": false
}
],
"relations": [
{
"y": 1,
"x": 0,
"type": "Clarification_question"
},
{
"y": 2,
"x": 0,
"type": "Comment"
}
]
}
]
}
}
The information of the keys in the "dialogues" field are as follows. The "edus" field is a list of utterances which contains the speaker and the text. The "context" field is a string of the dialogue. The "qas" field includes questions and the answers. The "relations" field is a list of relations between two utterances,and classify the sense of relations.
Thanks to Prof. Min-Yen Kan and Dr. Wenqiang Lei of National University of Singapore. Thanks to Zihao Zheng, Zekun Wang, Yibo Sun, Tianwen Jiang, Daxing Zhang, Rongtian Bian, Heng Zhang, Zhenyu Hu and other students of Harbin Institute of Technology for their support in the process of data set annotation. Thanks to Wenpeng Hu, a Ph.D. student of Peking University, for providing preprocessed Ubuntu data.