Can anyone download all the files in the url list file? #24

wxp16 · 2019-11-19T16:53:52Z

I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot HTTP Error: 403 Forbidden How to fix this ? Or can i get the all the bookscorpus data from somewhere ?

Thanks

The text was updated successfully, but these errors were encountered:

RinaldsViksna · 2019-11-20T07:53:12Z

Same for me as well

RinaldsViksna · 2019-11-20T07:53:42Z

It seems this crawler is too aggressive and we got banned

soskek · 2019-11-20T14:23:36Z

Hmm. It seems that smashwords.com is making its blocking very strict. For avoiding that, we need really patient (maybe too slow) crawling or proxy-based crawling with multiple IPs of proxy servers, both of which are tough.
I guess crawling BookCorpus now became really difficult.

soskek · 2019-11-20T14:24:11Z

Then, I had hesitated to say this so far. But, now this may help you and others in the NLP community.

I found a tweet by someone suggests the existence of a copy of the original BookCorpus on Google drive. The dataset seems to have 74004228 lines and 984846357 tokens, which matches the stats written in the paper.

If you try, (of course!) please use it at your own risk.

prakharg24 · 2019-12-11T04:33:31Z

Hi.

Have you downloaded or used the bookcorpus mirror in the tweet that you just linked?

I tried downloading and working with that dataset and I am running into an issue that I cannot distinguish when a book ends and another one starts in the complete txt file. It is just a continuous list of sentences. (I notice in your code you have done that by inserting 4 'Enter' after every book.. The same behavior is not present in that corpus)

Do you have any suggestions?

Thank You

soskek · 2019-12-11T04:56:57Z

I have no idea. I don't know even what was distributed first.
Most of the language models (even document-level ones) or others, including skip-thought vectors, are not troubled with the lack of boundaries between books. So, the original version might be distributed as the tweeted one is.
Wish you well.

prakharg24 · 2019-12-11T05:03:28Z

Ohh I see. I was actually worried about reproducing the 'Sentence-Order Prediction Task' from the paper ALBERT : A Lite BERT

They emphasize that the 2 sentences taken during training are 50% of the time from the same document and 50% of times from different documents. I will try to read the fine print in their paper to see if document separation is even an issue or not.

Thanks anyway. The tweet link was really helpful.

soskek · 2019-12-11T05:09:46Z

That sounds an issue. One dirty (but maybe practically good) trick is seeing near lines (e.g. within 100 lines) as same-document and distant lines (e.g. outside 100000 lines) as different-document. Of course, it is good to ask the authors.
Anyway, good luck!

jsc · 2019-12-18T19:56:36Z

I just tried to investigate further. For example if you go to this url -- https://www.smashwords.com/books/download/626006/8/latest/0/0/the-watchmakers-daughter.epub it says you have to have an account to read this book. Accounts are free but it might take some work to get the crawler to use your login....

BillMK · 2019-12-20T08:12:12Z

as paper said, the dataset contains totally 11038 books, and 74004228 sentences which is the same size compares with the tweet dataset. So I seperate books each 6700 sentences(74004228/11038 ).Not sure whether this seperation will affect the accuracy or not...

dgrahn · 2020-02-05T15:19:48Z

I can verify, as best as possible, that the link as of today is clean. It just extracts two large text files.

richarddwang · 2020-06-15T06:36:58Z

Hi all, uses the text mirror mentioned in the comment above, my pr that adds BookCorpus to HuggingFace/nlp has been merged.
(the txt files has been copied to their own cloud storage)

You should be able to download the dataset by book = nlp.load_dataset('bookcorpus')

soskek · 2020-09-05T07:29:51Z

Thank you @richarddwang! It is a great job. I added the reference to nlp in README of this repo.

soskek mentioned this issue Nov 20, 2019

Could you share the processed all.txt? #23

Closed

soskek mentioned this issue May 20, 2020

BookCorpus dataset gone ryankiros/neural-storyteller#17

Open

paulhendricks mentioned this issue May 28, 2020

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden NVIDIA/DeepLearningExamples#536

Closed

richarddwang mentioned this issue Jun 7, 2020

add Toronto BooksCorpus huggingface/datasets#248

Merged

richarddwang mentioned this issue Sep 30, 2020

Bookcorpus data contains pretokenized text huggingface/datasets#486

Closed

de8ug mentioned this issue Oct 28, 2020

Python随身听-2020-10-28技术精选 pythonradio/pythonradio.github.io#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can anyone download all the files in the url list file? #24

Can anyone download all the files in the url list file? #24

wxp16 commented Nov 19, 2019

RinaldsViksna commented Nov 20, 2019

RinaldsViksna commented Nov 20, 2019

soskek commented Nov 20, 2019

soskek commented Nov 20, 2019

prakharg24 commented Dec 11, 2019

soskek commented Dec 11, 2019 •

edited

Loading

prakharg24 commented Dec 11, 2019

soskek commented Dec 11, 2019

jsc commented Dec 18, 2019

BillMK commented Dec 20, 2019

dgrahn commented Feb 5, 2020

richarddwang commented Jun 15, 2020 •

edited

Loading

soskek commented Sep 5, 2020

Can anyone download all the files in the url list file? #24

Can anyone download all the files in the url list file? #24

Comments

wxp16 commented Nov 19, 2019

RinaldsViksna commented Nov 20, 2019

RinaldsViksna commented Nov 20, 2019

soskek commented Nov 20, 2019

soskek commented Nov 20, 2019

prakharg24 commented Dec 11, 2019

soskek commented Dec 11, 2019 • edited Loading

prakharg24 commented Dec 11, 2019

soskek commented Dec 11, 2019

jsc commented Dec 18, 2019

BillMK commented Dec 20, 2019

dgrahn commented Feb 5, 2020

richarddwang commented Jun 15, 2020 • edited Loading

soskek commented Sep 5, 2020

soskek commented Dec 11, 2019 •

edited

Loading

richarddwang commented Jun 15, 2020 •

edited

Loading