Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set 'Referer' HTTP request header #2062

Merged
merged 2 commits into from
Jul 22, 2024
Merged

Set 'Referer' HTTP request header #2062

merged 2 commits into from
Jul 22, 2024

Conversation

audiodude
Copy link
Member

@audiodude audiodude commented Jul 15, 2024

Fixes #2061

Tested with Italian wikipedia download with article listed in bug.

@audiodude audiodude requested a review from kelson42 July 15, 2024 16:35
Copy link

codecov bot commented Jul 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.44%. Comparing base (c09bc92) to head (157c2b9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2062      +/-   ##
==========================================
+ Coverage   74.38%   74.44%   +0.06%     
==========================================
  Files          41       41              
  Lines        3146     3146              
  Branches      689      689              
==========================================
+ Hits         2340     2342       +2     
+ Misses        686      684       -2     
  Partials      120      120              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kelson42
Copy link
Collaborator

kelson42 commented Jul 15, 2024

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

@audiodude
Copy link
Member Author

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

@kelson42
Copy link
Collaborator

kelson42 commented Jul 16, 2024

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better?

@kelson42 kelson42 changed the title Set 'Referer' header Set 'Referer' HTTP request header Jul 16, 2024
@audiodude
Copy link
Member Author

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better?

It just seems better to put a "dummy" value that is obviously a hack, than a WMF domain which is potentially misleading.

@kelson42
Copy link
Collaborator

kelson42 commented Jul 17, 2024

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better?

It just seems better to put a "dummy" value that is obviously a hack, than a WMF domain which is potentially misleading.

Not really convinced, but I guess you can argue that way. I believe we should anyway have a small test validating this(proper download of map) to secure next time we detect and can fix such kind of issue early.

@kelson42
Copy link
Collaborator

@audiodude Maybe we could just extend saveArticles.test.ts around the London test as https://en.wikipedia.org/wiki/London as a map?

@audiodude
Copy link
Member Author

@audiodude Maybe we could just extend saveArticles.test.ts around the London test as https://en.wikipedia.org/wiki/London as a map?

Added test in downloader.test.ts PTAL, thanks!

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@audiodude audiodude merged commit 6c919b0 into main Jul 22, 2024
6 checks passed
@audiodude audiodude deleted the referer branch July 22, 2024 15:04
const mwResp = await axios(url, this.arrayBufferRequestOptions)
// The 'Referer' header is set to get around WMF domain origin restrictions.
// See: https://github.com/openzim/mwoffliner/issues/2061
const mwResp = await axios(url, { ...this.arrayBufferRequestOptions, headers: { Referer: 'https://localhost/' } })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the actual site/page you are scraping as the referrer instead of this fiction? That would make your scraper function in the same way as any normal user-agent consuming the wiki content and allow you to avoid being seen as deliberately violating hot linking protections used on the Wikimedia content farm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bd808 Was pretty much my proposal, see comments above and responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wikimedia Maps HTTP 403 (acting like a bad bot)
3 participants