WikiReading

This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).

Run get_data.sh to download data the English WikiReading dataset.

Run get_ru_data.sh and get_tr_data.sh to get the Russian and Turkish version of the WikiReading data, respectively.

If you use the data or the results reported in the papers, please feel free to cite them.

@inproceedings {hewlett2016wikireading,
 title = {{WIKIREADING}: A Novel Large-scale Language Understanding Task over {Wikipedia}},
 booktitle = {Proceedings of the The 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)},
 author = {Daniel Hewlett and Alexandre Lacoste and Llion Jones and Illia Polosukhin and Andrew Fandrianto and Jay Han and Matthew Kelcey and David Berthelot},
 year = {2016}
}

and

@inproceedings{byte-level2018kenter,
  title={Byte-level Machine Reading across Morphologically Varied Languages},
  author={Tom Kenter and Llion Jones and Daniel Hewlett},
  booktitle={Proceedings of the The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)},
  year={2018}
}

WikiReading Data

Train, validation, and test datasets are in TFRecord or streamed JSON (one JSON object per line). They are 45GB, 5GB, and 3GB respectively. For example test.tar.gz contains 15 files whose union is the whole test set. We split them to help speed up training/testing by parallelizing reads. Any one of the shards can be opened with a TFRecordReader or with your favorite JSON reader for every line. Download a sample TFRecord shard or a sample JSON shard of the validation set (1/15th) to play around with if disk space is limited.

English

file	size	description
train	16,039,400 examples	TFRecord https://storage.googleapis.com/wikireading/train.tar.gz
		JSON https://storage.googleapis.com/wikireading/train.json.tar.gz
validation	1,886,798 examples	TFRecord https://storage.googleapis.com/wikireading/validation.tar.gz
		JSON https://storage.googleapis.com/wikireading/validation.json.tar.gz
test	941,280 examples	TFRecord https://storage.googleapis.com/wikireading/test.tar.gz
		JSON https://storage.googleapis.com/wikireading/test.json.tar.gz
document.vocab	176,978 tokens	vocabulary for tokens from Wikipedia documents
answer.vocab	10,876 tokens	vocabulary for tokens from answers
raw_answer.vocab	1,359,244 tokens	vocabulary for whole answers as they appear in WikiData
type.vocab	80 tokens	vocabulary for Part of Speech tags
character.vocab	12486 tokens	vocabulary for all characters that appear in the string sequences

Russian

file	size	description
train	4,259,667 examples	TFRecord https://storage.googleapis.com/wikireading/ru/train.tar.gz
		JSON https://storage.googleapis.com/wikireading/ru/train.json.tar.gz
validation	531,412 examples	TFRecord https://storage.googleapis.com/wikireading/ru/valid.tar.gz
		JSON https://storage.googleapis.com/wikireading/ru/valid.json.tar.gz
test	533,026 examples	TFRecord https://storage.googleapis.com/wikireading/ru/test.tar.gz
		JSON https://storage.googleapis.com/wikireading/ru/test.json.tar.gz
document.vocab	965,157 tokens	vocabulary for tokens from Wikipedia documents
answer.vocab	57,952 tokens	vocabulary for tokens from answers
type.vocab	56 tokens	vocabulary for Part of Speech tags
character.vocab	12,205 tokens	vocabulary for all characters that appear in the string sequences

Turkish

file	size	description
train	654,705 examples	TFRecord https://storage.googleapis.com/wikireading/tr/train.tar.gz
		JSON https://storage.googleapis.com/wikireading/tr/train.json.tar.gz
validation	81,622 examples	TFRecord https://storage.googleapis.com/wikireading/tr/valid.tar.gz
		JSON https://storage.googleapis.com/wikireading/tr/valid.json.tar.gz
test	82,643 examples	TFRecord https://storage.googleapis.com/wikireading/tr/test.tar.gz
		JSON https://storage.googleapis.com/wikireading/tr/test.json.tar.gz
document.vocab	215,294 tokens	vocabulary for tokens from Wikipedia documents
answer.vocab	11,123 tokens	vocabulary for tokens from answers
type.vocab	10 tokens	vocabulary for Part of Speech tags
character.vocab	6638 tokens	vocabulary for all characters that appear in the string sequences

Features

Each instance contains these features (some features may be empty).

feature name	description
`answer_breaks`	Indices into `answer_ids` and `answer_string_sequence`.
	Used to delimit multiple answers to a question, e.g. a list answer.
`answer_ids`	`answer.vocab` ID sequence for words in the answer.
`answer_location`	Word indices into the document where any one token in the answer was found.
`answer_sequence`	`document.vocab` ID sequence for words in the answer.
`answer_string_sequence`	String sequence for the words in the answer.
`break_levels`	One integer [0,4] indicating a break level for each word in the document.
	* 0 = no separation between tokens
	* 1 = tokens separated by space
	* 2 = tokens separated by line break
	* 3 = tokens separated by sentence break
	* 4 = tokens separated by paragraph break
`document_sequence`	`document.vocab` ID sequence for words in the document.
`full_match_answer_location`	Word indices into the document where all contiguous tokens in answer were found.
`paragraph_breaks`	Word indices into the document indicating a paragraph boundary.
`question_sequence`	`document.vocab` ID sequence for words in the question.
`question_string_sequence`	String sequence for the words in the question.
`raw_answer_ids`	`raw_answer.vocab` ID for the answer.
`raw_answers`	A string containing the raw answer.
`sentence_breaks`	Word indices into the document indicating a sentence boundary.
`string_sequence`	String sequence for the words in the document. `character.vocab` for char IDs.
`type_sequence`	`type.vocab` ID sequence for tags (POS, type, etc.) in the document.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
models		models
LICENSE.md		LICENSE.md
README.md		README.md
get_data.sh		get_data.sh
get_ru_data.sh		get_ru_data.sh
get_tr_data.sh		get_tr_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiReading

WikiReading Data

English

Russian

Turkish

Features

About

Releases

Packages

Contributors 6

Languages

License

google-research-datasets/wiki-reading

Folders and files

Latest commit

History

Repository files navigation

WikiReading

WikiReading Data

English

Russian

Turkish

Features

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages