Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scraped site as referrer rather than a "hack" #2067

Closed
bd808 opened this issue Jul 22, 2024 · 5 comments · Fixed by #2068
Closed

Use scraped site as referrer rather than a "hack" #2067

bd808 opened this issue Jul 22, 2024 · 5 comments · Fixed by #2068
Assignees
Milestone

Comments

@bd808
Copy link

bd808 commented Jul 22, 2024

Why not use the actual site/page you are scraping as the referrer instead of this fiction? That would make your scraper function in the same way as any normal user-agent consuming the wiki content and allow you to avoid being seen as deliberately violating hot linking protections used on the Wikimedia content farm.

Originally posted by @bd808 in #2062 (comment)

@kelson42
Copy link
Collaborator

@audiodude I guess you already know my opinion on this! ;)

@audiodude
Copy link
Member

Sorry, I was confused because mwoffliner scrapes many wikis, not just WMF ones.

But I suppose that the only wikis with map tiles will be the ones that are hosted on valid WMF domains, so this could definitely work.

@bd808
Copy link
Author

bd808 commented Jul 22, 2024

Sorry, I was confused because mwoffliner scrapes many wikis, not just WMF ones.

But I suppose that the only wikis with map tiles will be the ones that are hosted on valid WMF domains, so this could definitely work.

If anyone is using mwoffliner to scrape non-Wikimedia wikis and also attempting to consume map tiles from the Wikimedia tile server as part of that scraping they should get the 403 rejection unless their hosting is at an authorized domain. This is the point of using "the very poor protection" of checking the HTTP referrer. We want the Wikimedia movement to be able to make use of the Wikimedia tile server, but we also have limited resources to devote to that title server and cannot scale it to serving map tiles for any and all on the Internet.

The currently implemented solution looks exactly like a bad faith actor deciding to circumvent the loose protections we have implemented. You read technical details about the service that we published in good faith to provide transparency for the Wikimedia community and then weaponized them against the very project that y'all claim to be attempting to advance. It is not a good look.

@audiodude
Copy link
Member

Sorry, I was confused because mwoffliner scrapes many wikis, not just WMF ones.
But I suppose that the only wikis with map tiles will be the ones that are hosted on valid WMF domains, so this could definitely work.

If anyone is using mwoffliner to scrape non-Wikimedia wikis and also attempting to consume map tiles from the Wikimedia tile server as part of that scraping they should get the 403 rejection unless their hosting is at an authorized domain. This is the point of using "the very poor protection" of checking the HTTP referrer. We want the Wikimedia movement to be able to make use of the Wikimedia tile server, but we also have limited resources to devote to that title server and cannot scale it to serving map tiles for any and all on the Internet.

Yes exactly. That's the original point that I missed: that we shouldn't expect map tiles to show up on say Minecraft Wiki, and that if they do, we shouldn't expect them to work.

The currently implemented solution looks exactly like a bad faith actor deciding to circumvent the loose protections we have implemented. You read technical details about the service that we published in good faith to provide transparency for the Wikimedia community and then weaponized them against the very project that y'all claim to be attempting to advance. It is not a good look.

To be fair, our discussion does indicate that we saw the current solution to be a temporary workaround while we sought to obtain the proper permissions. I think you should consider our opening of the phabricator ticket and brining light to the issue to be a good faith effort towards that goal.

@bd808
Copy link
Author

bd808 commented Jul 23, 2024

Thanks for the quick attention folks.

I'm also sorry if I was overly aggressive in my criticisms of the original work around. My bad days shouldn't be everyone else's problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants