-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can anyone download all the files in the url list file? #24
Comments
Same for me as well |
It seems this crawler is too aggressive and we got banned |
Hmm. It seems that smashwords.com is making its blocking very strict. For avoiding that, we need really patient (maybe too slow) crawling or proxy-based crawling with multiple IPs of proxy servers, both of which are tough. |
Then, I had hesitated to say this so far. But, now this may help you and others in the NLP community. I found a tweet by someone suggests the existence of a copy of the original BookCorpus on Google drive. The dataset seems to have 74004228 lines and 984846357 tokens, which matches the stats written in the paper. If you try, (of course!) please use it at your own risk. |
Hi. Have you downloaded or used the bookcorpus mirror in the tweet that you just linked? I tried downloading and working with that dataset and I am running into an issue that I cannot distinguish when a book ends and another one starts in the complete txt file. It is just a continuous list of sentences. (I notice in your code you have done that by inserting 4 'Enter' after every book.. The same behavior is not present in that corpus) Do you have any suggestions? Thank You |
I have no idea. I don't know even what was distributed first. |
Ohh I see. I was actually worried about reproducing the 'Sentence-Order Prediction Task' from the paper ALBERT : A Lite BERT They emphasize that the 2 sentences taken during training are 50% of the time from the same document and 50% of times from different documents. I will try to read the fine print in their paper to see if document separation is even an issue or not. Thanks anyway. The tweet link was really helpful. |
That sounds an issue. One dirty (but maybe practically good) trick is seeing near lines (e.g. within 100 lines) as same-document and distant lines (e.g. outside 100000 lines) as different-document. Of course, it is good to ask the authors. |
I just tried to investigate further. For example if you go to this url -- https://www.smashwords.com/books/download/626006/8/latest/0/0/the-watchmakers-daughter.epub it says you have to have an account to read this book. Accounts are free but it might take some work to get the crawler to use your login.... |
I can verify, as best as possible, that the link as of today is clean. It just extracts two large text files. |
Hi all, uses the text mirror mentioned in the comment above, my pr that adds BookCorpus to HuggingFace/nlp has been merged. You should be able to download the dataset by |
Thank you @richarddwang! It is a great job. I added the reference to nlp in README of this repo. |
I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot
HTTP Error: 403 Forbidden
How to fix this ? Or can i get the all the bookscorpus data from somewhere ?Thanks
The text was updated successfully, but these errors were encountered: