Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add camrest dataset in unified data format #37

Merged
merged 1 commit into from
Mar 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 53 additions & 16 deletions data/unified_datasets/camrest/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,61 @@
# README
# Dataset Card for Camrest

## Features
- **Repository:** https://www.repository.cam.ac.uk/handle/1810/260970
- **Paper:** https://aclanthology.org/D16-1233/
- **Leaderboard:** None
- **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com)

- Annotations: dialogue act, character-level span for non-categorical slots.
### Dataset Summary

Statistics:
Cambridge restaurant dialogue domain dataset collected for developing neural network based dialogue systems. The two papers published based on this dataset are: 1. A Network-based End-to-End Trainable Task-oriented Dialogue System 2. Conditional Generation and Snapshot Learning in Neural Dialogue Systems. The dataset was collected based on the Wizard of Oz experiment on Amazon MTurk. Each dialogue contains a goal label and several exchanges between a customer and the system. Each user turn was labelled by a set of slot-value pairs representing a coarse representation of dialogue state (`slu` field). There are in total 676 dialogue, in which most of the dialogues are finished but some of dialogues were not.

| | \# dialogues | \# utterances | avg. turns | avg. tokens | \# domains |
| ----- | ------------ | ------------- | ---------- | ----------- | ---------- |
| train | 406 | 2936 | 7.23 | 11.36 | 1 |
| dev | 135 | 941 | 6.97 | 11.99 | 1 |
| train | 135 | 935 | 6.93 | 11.87 | 1 |
- **How to get the transformed data from original data:**
- Run `python preprocess.py` in the current directory. Need `../../camrest/` as the original data.
- **Main changes of the transformation:**
- Add dialogue act annotation according to the state change. This step was done by ConvLab-2 and we use the processed dialog acts here.
- Rename `pricerange` to `price range`
- Add character level span annotation for non-categorical slots.
- **Annotations:**
- user goal, dialogue acts, state.

## Main changes
### Supported Tasks and Leaderboards

- domain is set to **restaurant**
- ignore some rare pair
- 3 values are not found in original utterances
- **dontcare** values in non-categorical slots are calculated in `evaluate.py` so `da_match` in evaluation is lower than actual number.
NLU, DST, Policy, NLG, E2E, User simulator

## Original data
### Languages

camrest used in convlab2, included in `data/` path
English

### Data Splits

| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) |
| ---------- | --------- | ---------- | ------- | ---------- | ----------- | --------------------- | -------------------- | ---------------------------- | ------------------------------- |
| train | 406 | 3342 | 8.23 | 10.6 | 1 | 100 | 100 | 100 | 99.83 |
| validation | 135 | 1076 | 7.97 | 11.26 | 1 | 100 | 100 | 100 | 100 |
| test | 135 | 1070 | 7.93 | 11.01 | 1 | 100 | 100 | 100 | 100 |
| all | 676 | 5488 | 8.12 | 10.81 | 1 | 100 | 100 | 100 | 99.9 |

1 domains: ['restaurant']
- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.
- **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage.

### Citation

```
@inproceedings{wen-etal-2016-conditional,
title = "Conditional Generation and Snapshot Learning in Neural Dialogue Systems",
author = "Wen, Tsung-Hsien and Ga{\v{s}}i{\'c}, Milica and Mrk{\v{s}}i{\'c}, Nikola and Rojas-Barahona, Lina M. and Su, Pei-Hao and Ultes, Stefan and Vandyke, David and Young, Steve",
booktitle = "Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2016",
address = "Austin, Texas",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D16-1233",
doi = "10.18653/v1/D16-1233",
pages = "2153--2162",
}
```

### Licensing Information

[**CC BY 4.0**](https://creativecommons.org/licenses/by/4.0/)
Binary file modified data/unified_datasets/camrest/data.zip
Binary file not shown.
Loading