Could you share the processed all.txt? #23

thudzj · 2019-10-25T01:20:25Z

Hi Sosuke,

Thanks a lot for the wonderful work! I expect to obtain the bookcorpus dataset with your crawler, but I failed to crawl the articles owing to some network errors. I am afraid that I cannot achieve a complete dataset. So could you please share with me the dataset you have got, e.g. the all.txt. My email address is [email protected]. Thanks!

Zhijie

soskek · 2019-10-27T11:05:14Z

Thanks for using my code!
Unfortunately, for reasons of copyrights and so on, I cannot directly distribute the data.
What kind of errors happened?

thudzj · 2019-10-27T11:16:06Z

Thanks! Something like 403 forbidden.

soskek · 2019-10-27T11:50:53Z

Hmm, looks tough, while I'm not familiar with connections in China.
A possible way is adding a user-agent in the header of the opener.

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

In download_*.py you're using, fix like

try:
    from cookielib import CookieJar
    cj = CookieJar()
    import urllib2
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    import urllib
    urlretrieve = urllib.urlretrieve
except ImportError:
    import http.cookiejar
    cj = http.cookiejar.CookieJar()
    import urllib
    opener = urllib.request.build_opener(
        urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urlretrieve = urllib.request.urlretrieve

If nothing changed, I already give up!

thudzj · 2019-10-28T01:58:12Z

Haha, I'll try. Thank you very much for the instant reply!

tshrjn · 2019-11-12T20:40:43Z

Hi there,

I'm also getting 403 Forbidden error, though when I'm able to successfully download via wget [URL]
example url being: https://www.smashwords.com/books/download/12640/6/latest/0/0/eliminate-your-debt-like-a-pro.txt

Here's a screenshot for reference:

soskek · 2019-11-13T01:20:01Z

Do you succeed with wget? I guessed some kind of IP block happened.

tshrjn · 2019-11-13T16:06:39Z

Yes, I was able to download using wget.

tshrjn · 2019-11-13T21:54:56Z

Actually, no, it fails with wget as well and adding --user-agent=Lynx in wget or the above code for Mozilla agent in python don't help either.

I'm on an us-east AWS EC2 instance.

soskek · 2019-11-20T14:27:01Z

Thank you for the information.
As #24 also reported, the crawling is getting hard.

@thudzj By the way, as shown in my comment (#24 (comment)), you can try the unknown file on Google Drive (at your own risk).

thudzj closed this as completed Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you share the processed all.txt? #23

Could you share the processed all.txt? #23

thudzj commented Oct 25, 2019

soskek commented Oct 27, 2019

thudzj commented Oct 27, 2019

soskek commented Oct 27, 2019

thudzj commented Oct 28, 2019

tshrjn commented Nov 12, 2019

soskek commented Nov 13, 2019

tshrjn commented Nov 13, 2019

tshrjn commented Nov 13, 2019 •

edited

Loading

soskek commented Nov 20, 2019

Could you share the processed all.txt? #23

Could you share the processed all.txt? #23

Comments

thudzj commented Oct 25, 2019

soskek commented Oct 27, 2019

thudzj commented Oct 27, 2019

soskek commented Oct 27, 2019

thudzj commented Oct 28, 2019

tshrjn commented Nov 12, 2019

soskek commented Nov 13, 2019

tshrjn commented Nov 13, 2019

tshrjn commented Nov 13, 2019 • edited Loading

soskek commented Nov 20, 2019

tshrjn commented Nov 13, 2019 •

edited

Loading