Add open book corpus #856

vblagoje · 2020-11-16T12:30:02Z

Adds book corpus based on Shawn Presser's work @richarddwang, the author of the original BookCorpus dataset, suggested it should be named OpenBookCorpus. I named it BookCorpusOpen to be easily located alphabetically. But, of course, we can rename it if needed.

It contains 17868 dataset items; each item contains two fields: title and text. The title is the name of the book (just the file name) while the text contains unprocessed book text. Note that bookcorpus is pre-segmented into a sentence while this bookcorpus is not. This is intentional (see #486) as some users might want to further process the text themselves.

@lhoestq and others please review this PR thoroughly. cc @shawwn

lhoestq

That's awesome thanks !
It looks like the test doesn't pass for the dummy data. Could you try to fix that ? Maybe the glob call is not able to find the epub.txt dummy file ?

datasets/bookcorpusopen/bookcorpusopen.py

vblagoje · 2020-11-16T18:04:15Z

@lhoestq I fixed issues except for the dummy_data zip file. But I think I know why is it happening. So when unzipping dummy_data.zip it gets save in /tmp directory where glob doesn't pick it up. For regular downloads, the archive gets unzipped in ~/.cache/huggingface. Could that be a reason?

lhoestq · 2020-11-16T18:07:24Z

Nice thanks :)

When testing with the dummy data, the download_manager.download_and_extract() call returns the path to the unzipped dummy_data.zip archive. Therefore glob should be able to find your dummy .epub.txt file

vblagoje · 2020-11-16T18:56:44Z

@lhoestq I understand but for some reason, it is not happening. I added logs to see where dummy_data.zip gets unzipped in /tmp but I suppose when the test process finishes that tmp is gone. I also tried to glob anything in _generate_examples from that directory using /* instead of **/*.epub.txt and nothing is being returned. Always an empty array.

lhoestq · 2020-11-16T19:37:42Z

Ok weird ! I can take a look tomorrow if you want

vblagoje · 2020-11-16T20:04:16Z

Please do, I will take a fresh look as well.

vblagoje · 2020-11-17T10:18:06Z

In generate_examples I wrote the following:

glob_target = os.path.join(directory, "**/*.epub.txt")
print(f"Glob target {glob_target    }")

And here is the test failure:

========================================================================================== FAILURES ===========================================================================================
________________________________________________________________ LocalDatasetTest.test_load_dataset_all_configs_bookcorpusopen ________________________________________________________________

self = <tests.test_dataset_common.LocalDatasetTest testMethod=test_load_dataset_all_configs_bookcorpusopen>, dataset_name = 'bookcorpusopen'

@slow
def test_load_dataset_all_configs(self, dataset_name):
    configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)

  self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True)

tests/test_dataset_common.py:232:

tests/test_dataset_common.py:193: in check_load_dataset
self.parent.assertTrue(len(dataset[split]) > 0)
E AssertionError: False is not true
------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------
Downloading and preparing dataset book_corpus_open/plain_text (download: 1.00 MiB, generated: 1.00 MiB, post-processed: Unknown size, total: 2.00 MiB) to /var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpmuu0_ln2/book_corpus_open/plain_text/1.0.0...
Glob target /var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpm6tpvb3f/extracted/d953b414cceb4fe3985eeaf68aec2f4435f166b2edf66863d805e3825b7d336b/dummy_data/**/*.epub.txt
Dataset book_corpus_open downloaded and prepared to /var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpmuu0_ln2/book_corpus_open/plain_text/1.0.0. Subsequent calls will reuse this data.
------------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------------

vblagoje · 2020-11-17T10:22:32Z

And when I do os.listdir on the given directory I get:

    glob_target = os.path.join(directory, "**/*.epub.txt")
    print(f"Glob target {glob_target    }")

  print(os.listdir(path=directory))

E FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpbu_aom5q/extracted/d953b414cceb4fe3985eeaf68aec2f4435f166b2edf66863d805e3825b7d336b/dummy_data'

lhoestq · 2020-11-17T13:19:24Z

Thanks for the info, I'm looking at it right now

lhoestq · 2020-11-17T13:40:21Z

Ok found the issue !

The dummy_data.zip file must be an archive of a folder named dummy_data. Currently the dummy_data.zip is an archive of a folder named book1. In order to have a valid dummy_data.zip file you must first take the dummy book1 folder, place it inside a folder named dummy_data and then compress the dummy_data folder to get dummy_data.zip

vblagoje · 2020-11-17T13:49:10Z

Excellent, I am on it @lhoestq

lhoestq

Awesome thank you so much for adding it :)

vblagoje · 2020-11-17T15:18:42Z

Awesome thank you so much for adding it :)

You're welcome, ok all tests are green now! I needed it asap as well. Thanks for your help @lhoestq .

shawwn · 2020-11-17T15:59:26Z

I just wanted to say thank you to everyone involved in making this happen! I was certain that I would have to add bookcorpusnew myself, but then @vblagoje came along and did it, and @lhoestq gave some great support in a timely fashion.

By the way @vblagoje, are you on Twitter? I'm https://twitter.com/theshawwn if you'd like to DM and say hello. Once again, thanks for doing this!

I'll mention over at soskek/bookcorpus#27 that this was merged.

vblagoje · 2020-11-17T16:47:30Z

Thank you Shawn. You did all the heavy lifting ;-)

shawwn · 2020-11-17T16:52:49Z

@vblagoje Would you be interested in adding books3 as well? https://twitter.com/theshawwn/status/1320282149329784833

Huggingface is interested and asked me to add it, but I had a bit of trouble during setup (#790) and never got around to it. At this point you have much more experience than I do with the datasets lib.

It seems like it might simply be a matter of copy-pasting this PR, changing books1 to books3, and possibly trimming off the leading paths -- each book is at e.g. the-eye/Books/Bibliotok/J/Jurassic Park.epub.txt, which is rather lengthy compared to just the filename -- but the full path is probably fine, so feel free to do the least amount of work that gets the job done. Otherwise I suppose I'll get around to it eventually; thanks again!

vblagoje · 2020-11-18T12:03:46Z

@shawwn I'll take a look as soon as I clear my work queue. TBH, I would likely work on making sure HF datasets has all the datasets used to train https://github.com/alexa/bort/ and these are: Wikipedia, Wiktionary, OpenWebText (Gokaslan and Cohen, 2019), UrbanDictionary, Onel Billion Words (Chelba et al., 2014), the news subset of Common Crawl (Nagel, 2016)10, and Bookcorpus. cc @lhoestq

snarb · 2023-04-06T13:08:16Z

@shawwn is your books3 corpus as a part of any dataset now?

shawwn · 2023-04-06T22:43:16Z

@snarb Books3 has been used in LLaMA (https://twitter.com/theshawwn/status/1643987377516580870) and in BloombergGPT (https://twitter.com/theshawwn/status/1641938293209047041). I don't know whether it's in a HuggingFace dataset yet, but you can access it via the original announcement tweet here: https://twitter.com/theshawwn/status/1320282149329784833

If you'd like to make it a huggingface dataset, I'd be grateful! I'm not sure what the process is.

LLaMA also noted that they deduplicated the books in books3, so it might be worth running some sort of dedup pass on it.

lhoestq · 2023-04-07T10:25:44Z

It's available here already :) https://huggingface.co/datasets/the_pile_books3

snarb · 2023-04-07T10:38:02Z

@shawwn how the pictures and tables are handled in such datasets? For example in IQ test or geometry it is hard to imagine to understand topic without images. I want to create a dataset with limited vocabulary. To make possible the training of llmm without big money possible. But still to get model that is able to reason and formulate thoughts well. Trying to use books for children, school educational resources, simplified wiki. Maybe you can sujjest some good data sources with your experience?

ashmalvayani · 2024-01-04T12:31:03Z

@lhoestq the link you shared for huggingface the_pile_books3 doesn't have the data anymore. Can you please provide us alternate link for downloading the dataset?

lhoestq · 2024-01-04T13:20:49Z

This dataset has been taken down a few months ago for copyright infringement and is no longer accessible.

You may look into other books datasets, like Project Gutenberg
(e.g. at https://huggingface.co/datasets/sedthh/gutenberg_english)

Add open book corpus

0805e1d

lhoestq reviewed Nov 16, 2020

View reviewed changes

datasets/bookcorpusopen/bookcorpusopen.py Outdated Show resolved Hide resolved

datasets/bookcorpusopen/bookcorpusopen.py Outdated Show resolved Hide resolved

PR review

6708dec

Quentin's fixes

e3c1576

lhoestq approved these changes Nov 17, 2020

View reviewed changes

Proper checksums in dataset_infos.json

0209420

lhoestq merged commit b15d448 into huggingface:master Nov 17, 2020

shawwn mentioned this pull request Nov 17, 2020

Here’s a download link for all of bookcorpus as of Sept 2020 soskek/bookcorpus#27

Open

shawwn mentioned this pull request Nov 17, 2020

[nlp_data] Add BookCorpus dmlc/gluon-nlp#1406

Open

mariosasko mentioned this pull request Oct 4, 2022

Bookcorpus data contains pretokenized text #486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add open book corpus #856

Add open book corpus #856

vblagoje commented Nov 16, 2020 •

edited

Loading

lhoestq left a comment

vblagoje commented Nov 16, 2020

lhoestq commented Nov 16, 2020

vblagoje commented Nov 16, 2020

lhoestq commented Nov 16, 2020

vblagoje commented Nov 16, 2020

vblagoje commented Nov 17, 2020

vblagoje commented Nov 17, 2020

lhoestq commented Nov 17, 2020

lhoestq commented Nov 17, 2020

vblagoje commented Nov 17, 2020

lhoestq left a comment

vblagoje commented Nov 17, 2020

shawwn commented Nov 17, 2020

vblagoje commented Nov 17, 2020

shawwn commented Nov 17, 2020

vblagoje commented Nov 18, 2020

snarb commented Apr 6, 2023

shawwn commented Apr 6, 2023

lhoestq commented Apr 7, 2023

snarb commented Apr 7, 2023

ashmalvayani commented Jan 4, 2024

lhoestq commented Jan 4, 2024

Add open book corpus #856

Add open book corpus #856

Conversation

vblagoje commented Nov 16, 2020 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

vblagoje commented Nov 16, 2020

lhoestq commented Nov 16, 2020

vblagoje commented Nov 16, 2020

lhoestq commented Nov 16, 2020

vblagoje commented Nov 16, 2020

vblagoje commented Nov 17, 2020

vblagoje commented Nov 17, 2020

lhoestq commented Nov 17, 2020

lhoestq commented Nov 17, 2020

vblagoje commented Nov 17, 2020

lhoestq left a comment

Choose a reason for hiding this comment

vblagoje commented Nov 17, 2020

shawwn commented Nov 17, 2020

vblagoje commented Nov 17, 2020

shawwn commented Nov 17, 2020

vblagoje commented Nov 18, 2020

snarb commented Apr 6, 2023

shawwn commented Apr 6, 2023

lhoestq commented Apr 7, 2023

snarb commented Apr 7, 2023

ashmalvayani commented Jan 4, 2024

lhoestq commented Jan 4, 2024

vblagoje commented Nov 16, 2020 •

edited

Loading