Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you share the processed all.txt? #23

Closed
thudzj opened this issue Oct 25, 2019 · 9 comments
Closed

Could you share the processed all.txt? #23

thudzj opened this issue Oct 25, 2019 · 9 comments

Comments

@thudzj
Copy link

thudzj commented Oct 25, 2019

Hi Sosuke,

Thanks a lot for the wonderful work! I expect to obtain the bookcorpus dataset with your crawler, but I failed to crawl the articles owing to some network errors. I am afraid that I cannot achieve a complete dataset. So could you please share with me the dataset you have got, e.g. the all.txt. My email address is [email protected]. Thanks!

Zhijie

@soskek
Copy link
Owner

soskek commented Oct 27, 2019

Thanks for using my code!
Unfortunately, for reasons of copyrights and so on, I cannot directly distribute the data.
What kind of errors happened?

@thudzj
Copy link
Author

thudzj commented Oct 27, 2019

Thanks! Something like 403 forbidden.

@soskek
Copy link
Owner

soskek commented Oct 27, 2019

Hmm, looks tough, while I'm not familiar with connections in China.
A possible way is adding a user-agent in the header of the opener.

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

In download_*.py you're using, fix like

try:
    from cookielib import CookieJar
    cj = CookieJar()
    import urllib2
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    import urllib
    urlretrieve = urllib.urlretrieve
except ImportError:
    import http.cookiejar
    cj = http.cookiejar.CookieJar()
    import urllib
    opener = urllib.request.build_opener(
        urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urlretrieve = urllib.request.urlretrieve

If nothing changed, I already give up!

@thudzj
Copy link
Author

thudzj commented Oct 28, 2019

Haha, I'll try. Thank you very much for the instant reply!

@thudzj thudzj closed this as completed Oct 28, 2019
@tshrjn
Copy link

tshrjn commented Nov 12, 2019

Hi there,

I'm also getting 403 Forbidden error, though when I'm able to successfully download via wget [URL]
example url being: https://www.smashwords.com/books/download/12640/6/latest/0/0/eliminate-your-debt-like-a-pro.txt

Here's a screenshot for reference:

Screen Shot 2019-11-12 at 3 31 20 PM

@soskek
Copy link
Owner

soskek commented Nov 13, 2019

Do you succeed with wget? I guessed some kind of IP block happened.

@tshrjn
Copy link

tshrjn commented Nov 13, 2019

Yes, I was able to download using wget.

@tshrjn
Copy link

tshrjn commented Nov 13, 2019

Actually, no, it fails with wget as well and adding --user-agent=Lynx in wget or the above code for Mozilla agent in python don't help either.

I'm on an us-east AWS EC2 instance.

@soskek
Copy link
Owner

soskek commented Nov 20, 2019

Thank you for the information.
As #24 also reported, the crawling is getting hard.

@thudzj By the way, as shown in my comment (#24 (comment)), you can try the unknown file on Google Drive (at your own risk).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants