Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can anyone download all the files in the url list file? #24

Open
wxp16 opened this issue Nov 19, 2019 · 13 comments
Open

Can anyone download all the files in the url list file? #24

wxp16 opened this issue Nov 19, 2019 · 13 comments

Comments

@wxp16
Copy link

wxp16 commented Nov 19, 2019

I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot HTTP Error: 403 Forbidden How to fix this ? Or can i get the all the bookscorpus data from somewhere ?

Thanks

@RinaldsViksna
Copy link

Same for me as well

@RinaldsViksna
Copy link

It seems this crawler is too aggressive and we got banned

@soskek
Copy link
Owner

soskek commented Nov 20, 2019

Hmm. It seems that smashwords.com is making its blocking very strict. For avoiding that, we need really patient (maybe too slow) crawling or proxy-based crawling with multiple IPs of proxy servers, both of which are tough.
I guess crawling BookCorpus now became really difficult.

@soskek
Copy link
Owner

soskek commented Nov 20, 2019

Then, I had hesitated to say this so far. But, now this may help you and others in the NLP community.

I found a tweet by someone suggests the existence of a copy of the original BookCorpus on Google drive. The dataset seems to have 74004228 lines and 984846357 tokens, which matches the stats written in the paper.

If you try, (of course!) please use it at your own risk.

@prakharg24
Copy link

Hi.

Have you downloaded or used the bookcorpus mirror in the tweet that you just linked?

I tried downloading and working with that dataset and I am running into an issue that I cannot distinguish when a book ends and another one starts in the complete txt file. It is just a continuous list of sentences. (I notice in your code you have done that by inserting 4 'Enter' after every book.. The same behavior is not present in that corpus)

Do you have any suggestions?

Thank You

@soskek
Copy link
Owner

soskek commented Dec 11, 2019

I have no idea. I don't know even what was distributed first.
Most of the language models (even document-level ones) or others, including skip-thought vectors, are not troubled with the lack of boundaries between books. So, the original version might be distributed as the tweeted one is.
Wish you well.

@prakharg24
Copy link

Ohh I see. I was actually worried about reproducing the 'Sentence-Order Prediction Task' from the paper ALBERT : A Lite BERT

They emphasize that the 2 sentences taken during training are 50% of the time from the same document and 50% of times from different documents. I will try to read the fine print in their paper to see if document separation is even an issue or not.

Thanks anyway. The tweet link was really helpful.

@soskek
Copy link
Owner

soskek commented Dec 11, 2019

That sounds an issue. One dirty (but maybe practically good) trick is seeing near lines (e.g. within 100 lines) as same-document and distant lines (e.g. outside 100000 lines) as different-document. Of course, it is good to ask the authors.
Anyway, good luck!

@jsc
Copy link

jsc commented Dec 18, 2019

I just tried to investigate further. For example if you go to this url -- https://www.smashwords.com/books/download/626006/8/latest/0/0/the-watchmakers-daughter.epub it says you have to have an account to read this book. Accounts are free but it might take some work to get the crawler to use your login....

@BillMK
Copy link

BillMK commented Dec 20, 2019

as paper said, the dataset contains totally 11038 books, and 74004228 sentences which is the same size compares with the tweet dataset. So I seperate books each 6700 sentences(74004228/11038 ).Not sure whether this seperation will affect the accuracy or not...
image

@dgrahn
Copy link

dgrahn commented Feb 5, 2020

I can verify, as best as possible, that the link as of today is clean. It just extracts two large text files.

@richarddwang
Copy link

richarddwang commented Jun 15, 2020

Hi all, uses the text mirror mentioned in the comment above, my pr that adds BookCorpus to HuggingFace/nlp has been merged.
(the txt files has been copied to their own cloud storage)

You should be able to download the dataset by book = nlp.load_dataset('bookcorpus')

@soskek
Copy link
Owner

soskek commented Sep 5, 2020

Thank you @richarddwang! It is a great job. I added the reference to nlp in README of this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants