-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add open book corpus #856
Add open book corpus #856
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome thanks !
It looks like the test doesn't pass for the dummy data. Could you try to fix that ? Maybe the glob
call is not able to find the epub.txt dummy file ?
@lhoestq I fixed issues except for the dummy_data zip file. But I think I know why is it happening. So when unzipping dummy_data.zip it gets save in /tmp directory where glob doesn't pick it up. For regular downloads, the archive gets unzipped in ~/.cache/huggingface. Could that be a reason? |
Nice thanks :) When testing with the dummy data, the |
@lhoestq I understand but for some reason, it is not happening. I added logs to see where dummy_data.zip gets unzipped in /tmp but I suppose when the test process finishes that tmp is gone. I also tried to glob anything in _generate_examples from that directory using /* instead of **/*.epub.txt and nothing is being returned. Always an empty array. |
Ok weird ! I can take a look tomorrow if you want |
Please do, I will take a fresh look as well. |
In generate_examples I wrote the following:
And here is the test failure: ========================================================================================== FAILURES =========================================================================================== self = <tests.test_dataset_common.LocalDatasetTest testMethod=test_load_dataset_all_configs_bookcorpusopen>, dataset_name = 'bookcorpusopen'
tests/test_dataset_common.py:232: tests/test_dataset_common.py:193: in check_load_dataset |
And when I do os.listdir on the given directory I get:
E FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/y_/6k6zhblx0k9dsdz5nd_z9x5c0000gp/T/tmpbu_aom5q/extracted/d953b414cceb4fe3985eeaf68aec2f4435f166b2edf66863d805e3825b7d336b/dummy_data' |
Thanks for the info, I'm looking at it right now |
Ok found the issue ! The dummy_data.zip file must be an archive of a folder named dummy_data. Currently the dummy_data.zip is an archive of a folder named book1. In order to have a valid dummy_data.zip file you must first take the dummy book1 folder, place it inside a folder named dummy_data and then compress the dummy_data folder to get dummy_data.zip |
Excellent, I am on it @lhoestq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you so much for adding it :)
You're welcome, ok all tests are green now! I needed it asap as well. Thanks for your help @lhoestq . |
I just wanted to say thank you to everyone involved in making this happen! I was certain that I would have to add bookcorpusnew myself, but then @vblagoje came along and did it, and @lhoestq gave some great support in a timely fashion. By the way @vblagoje, are you on Twitter? I'm https://twitter.com/theshawwn if you'd like to DM and say hello. Once again, thanks for doing this! I'll mention over at soskek/bookcorpus#27 that this was merged. |
Thank you Shawn. You did all the heavy lifting ;-) |
@vblagoje Would you be interested in adding books3 as well? https://twitter.com/theshawwn/status/1320282149329784833 Huggingface is interested and asked me to add it, but I had a bit of trouble during setup (#790) and never got around to it. At this point you have much more experience than I do with the datasets lib. It seems like it might simply be a matter of copy-pasting this PR, changing books1 to books3, and possibly trimming off the leading paths -- each book is at e.g. the-eye/Books/Bibliotok/J/Jurassic Park.epub.txt, which is rather lengthy compared to just the filename -- but the full path is probably fine, so feel free to do the least amount of work that gets the job done. Otherwise I suppose I'll get around to it eventually; thanks again! |
@shawwn I'll take a look as soon as I clear my work queue. TBH, I would likely work on making sure HF datasets has all the datasets used to train https://github.com/alexa/bort/ and these are: Wikipedia, Wiktionary, OpenWebText (Gokaslan and Cohen, 2019), UrbanDictionary, Onel Billion Words (Chelba et al., 2014), the news subset of Common Crawl (Nagel, 2016)10, and Bookcorpus. cc @lhoestq |
@shawwn is your books3 corpus as a part of any dataset now? |
@snarb Books3 has been used in LLaMA (https://twitter.com/theshawwn/status/1643987377516580870) and in BloombergGPT (https://twitter.com/theshawwn/status/1641938293209047041). I don't know whether it's in a HuggingFace dataset yet, but you can access it via the original announcement tweet here: https://twitter.com/theshawwn/status/1320282149329784833 If you'd like to make it a huggingface dataset, I'd be grateful! I'm not sure what the process is. LLaMA also noted that they deduplicated the books in books3, so it might be worth running some sort of dedup pass on it. |
It's available here already :) https://huggingface.co/datasets/the_pile_books3 |
@shawwn how the pictures and tables are handled in such datasets? For example in IQ test or geometry it is hard to imagine to understand topic without images. I want to create a dataset with limited vocabulary. To make possible the training of llmm without big money possible. But still to get model that is able to reason and formulate thoughts well. Trying to use books for children, school educational resources, simplified wiki. Maybe you can sujjest some good data sources with your experience? |
@lhoestq the link you shared for huggingface the_pile_books3 doesn't have the data anymore. Can you please provide us alternate link for downloading the dataset? |
This dataset has been taken down a few months ago for copyright infringement and is no longer accessible. You may look into other books datasets, like Project Gutenberg |
Adds book corpus based on Shawn Presser's work @richarddwang, the author of the original BookCorpus dataset, suggested it should be named OpenBookCorpus. I named it BookCorpusOpen to be easily located alphabetically. But, of course, we can rename it if needed.
It contains 17868 dataset items; each item contains two fields: title and text. The title is the name of the book (just the file name) while the text contains unprocessed book text. Note that bookcorpus is pre-segmented into a sentence while this bookcorpus is not. This is intentional (see #486) as some users might want to further process the text themselves.
@lhoestq and others please review this PR thoroughly. cc @shawwn