Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add open book corpus #856

Merged
merged 4 commits into from
Nov 17, 2020
Merged

Add open book corpus #856

merged 4 commits into from
Nov 17, 2020

Conversation

vblagoje
Copy link
Contributor

@vblagoje vblagoje commented Nov 16, 2020

Adds book corpus based on Shawn Presser's work @richarddwang, the author of the original BookCorpus dataset, suggested it should be named OpenBookCorpus. I named it BookCorpusOpen to be easily located alphabetically. But, of course, we can rename it if needed.

It contains 17868 dataset items; each item contains two fields: title and text. The title is the name of the book (just the file name) while the text contains unprocessed book text. Note that bookcorpus is pre-segmented into a sentence while this bookcorpus is not. This is intentional (see #486) as some users might want to further process the text themselves.

@lhoestq and others please review this PR thoroughly. cc @shawwn

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome thanks !
It looks like the test doesn't pass for the dummy data. Could you try to fix that ? Maybe the glob call is not able to find the epub.txt dummy file ?

datasets/bookcorpusopen/bookcorpusopen.py Outdated Show resolved Hide resolved
datasets/bookcorpusopen/bookcorpusopen.py Outdated Show resolved Hide resolved
@vblagoje
Copy link
Contributor Author

@lhoestq I fixed issues except for the dummy_data zip file. But I think I know why is it happening. So when unzipping dummy_data.zip it gets save in /tmp directory where glob doesn't pick it up. For regular downloads, the archive gets unzipped in ~/.cache/huggingface. Could that be a reason?

@lhoestq
Copy link
Member

lhoestq commented Nov 16, 2020

Nice thanks :)

When testing with the dummy data, the download_manager.download_and_extract() call returns the path to the unzipped dummy_data.zip archive. Therefore glob should be able to find your dummy .epub.txt file

@vblagoje
Copy link
Contributor Author

@lhoestq I understand but for some reason, it is not happening. I added logs to see where dummy_data.zip gets unzipped in /tmp but I suppose when the test process finishes that tmp is gone. I also tried to glob anything in _generate_examples from that directory using /* instead of **/*.epub.txt and nothing is being returned. Always an empty array.

@lhoestq
Copy link
Member

lhoestq commented Nov 16, 2020

Ok weird ! I can take a look tomorrow if you want

@vblagoje
Copy link
Contributor Author

Please do, I will take a fresh look as well.

@vblagoje
Copy link
Contributor Author

In generate_examples I wrote the following:

glob_target = os.path.join(directory, "**/*.epub.txt")
print(f"Glob target {glob_target    }")

And here is the test failure:

========================================================================================== FAILURES ===========================================================================================
________________________________________________________________ LocalDatasetTest.test_load_dataset_all_configs_bookcorpusopen ________________________________________________________________

self = <tests.test_dataset_common.LocalDatasetTest testMethod=test_load_dataset_all_configs_bookcorpusopen>, dataset_name = 'bookcorpusopen'

@slow
def test_load_dataset_all_configs(self, dataset_name):
    configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)
  self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True)

tests/test_dataset_common.py:232:


tests/test_dataset_common.py:193: in check_load_dataset
self.parent.assertTrue(len(dataset[split]) > 0)
E AssertionError: False is not true
------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------
Downloading and preparing dataset book_corpus_open/plain_text (download: 1.00 MiB, generated: 1.00 MiB, post-processed: Unknown size, total: 2.00 MiB) to /var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpmuu0_ln2/book_corpus_open/plain_text/1.0.0...
Glob target /var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpm6tpvb3f/extracted/d953b414cceb4fe3985eeaf68aec2f4435f166b2edf66863d805e3825b7d336b/dummy_data/**/*.epub.txt
Dataset book_corpus_open downloaded and prepared to /var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpmuu0_ln2/book_corpus_open/plain_text/1.0.0. Subsequent calls will reuse this data.
------------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------------

@vblagoje
Copy link
Contributor Author

And when I do os.listdir on the given directory I get:

    glob_target = os.path.join(directory, "**/*.epub.txt")
    print(f"Glob target {glob_target    }")
  print(os.listdir(path=directory))

E FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpbu_aom5q/extracted/d953b414cceb4fe3985eeaf68aec2f4435f166b2edf66863d805e3825b7d336b/dummy_data'

@lhoestq
Copy link
Member

lhoestq commented Nov 17, 2020

Thanks for the info, I'm looking at it right now

@lhoestq
Copy link
Member

lhoestq commented Nov 17, 2020

Ok found the issue !

The dummy_data.zip file must be an archive of a folder named dummy_data. Currently the dummy_data.zip is an archive of a folder named book1. In order to have a valid dummy_data.zip file you must first take the dummy book1 folder, place it inside a folder named dummy_data and then compress the dummy_data folder to get dummy_data.zip

@vblagoje
Copy link
Contributor Author

Excellent, I am on it @lhoestq

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you so much for adding it :)

@vblagoje
Copy link
Contributor Author

Awesome thank you so much for adding it :)

You're welcome, ok all tests are green now! I needed it asap as well. Thanks for your help @lhoestq .

@lhoestq lhoestq merged commit b15d448 into huggingface:master Nov 17, 2020
@shawwn
Copy link

shawwn commented Nov 17, 2020

I just wanted to say thank you to everyone involved in making this happen! I was certain that I would have to add bookcorpusnew myself, but then @vblagoje came along and did it, and @lhoestq gave some great support in a timely fashion.

By the way @vblagoje, are you on Twitter? I'm https://twitter.com/theshawwn if you'd like to DM and say hello. Once again, thanks for doing this!

I'll mention over at soskek/bookcorpus#27 that this was merged.

@vblagoje
Copy link
Contributor Author

Thank you Shawn. You did all the heavy lifting ;-)

@shawwn
Copy link

shawwn commented Nov 17, 2020

@vblagoje Would you be interested in adding books3 as well? https://twitter.com/theshawwn/status/1320282149329784833

Huggingface is interested and asked me to add it, but I had a bit of trouble during setup (#790) and never got around to it. At this point you have much more experience than I do with the datasets lib.

It seems like it might simply be a matter of copy-pasting this PR, changing books1 to books3, and possibly trimming off the leading paths -- each book is at e.g. the-eye/Books/Bibliotok/J/Jurassic Park.epub.txt, which is rather lengthy compared to just the filename -- but the full path is probably fine, so feel free to do the least amount of work that gets the job done. Otherwise I suppose I'll get around to it eventually; thanks again!

@vblagoje
Copy link
Contributor Author

@shawwn I'll take a look as soon as I clear my work queue. TBH, I would likely work on making sure HF datasets has all the datasets used to train https://github.com/alexa/bort/ and these are: Wikipedia, Wiktionary, OpenWebText (Gokaslan and Cohen, 2019), UrbanDictionary, Onel Billion Words (Chelba et al., 2014), the news subset of Common Crawl (Nagel, 2016)10, and Bookcorpus. cc @lhoestq

@snarb
Copy link

snarb commented Apr 6, 2023

@shawwn is your books3 corpus as a part of any dataset now?

@shawwn
Copy link

shawwn commented Apr 6, 2023

@snarb Books3 has been used in LLaMA (https://twitter.com/theshawwn/status/1643987377516580870) and in BloombergGPT (https://twitter.com/theshawwn/status/1641938293209047041). I don't know whether it's in a HuggingFace dataset yet, but you can access it via the original announcement tweet here: https://twitter.com/theshawwn/status/1320282149329784833

If you'd like to make it a huggingface dataset, I'd be grateful! I'm not sure what the process is.

LLaMA also noted that they deduplicated the books in books3, so it might be worth running some sort of dedup pass on it.

@lhoestq
Copy link
Member

lhoestq commented Apr 7, 2023

It's available here already :) https://huggingface.co/datasets/the_pile_books3

@snarb
Copy link

snarb commented Apr 7, 2023

@shawwn how the pictures and tables are handled in such datasets? For example in IQ test or geometry it is hard to imagine to understand topic without images. I want to create a dataset with limited vocabulary. To make possible the training of llmm without big money possible. But still to get model that is able to reason and formulate thoughts well. Trying to use books for children, school educational resources, simplified wiki. Maybe you can sujjest some good data sources with your experience?

@ashmalvayani
Copy link

@lhoestq the link you shared for huggingface the_pile_books3 doesn't have the data anymore. Can you please provide us alternate link for downloading the dataset?
image

@lhoestq
Copy link
Member

lhoestq commented Jan 4, 2024

This dataset has been taken down a few months ago for copyright infringement and is no longer accessible.

You may look into other books datasets, like Project Gutenberg
(e.g. at https://huggingface.co/datasets/sedthh/gutenberg_english)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants