Use scraped site as referrer rather than a "hack" #2067

bd808 · 2024-07-22T17:14:49Z

Why not use the actual site/page you are scraping as the referrer instead of this fiction? That would make your scraper function in the same way as any normal user-agent consuming the wiki content and allow you to avoid being seen as deliberately violating hot linking protections used on the Wikimedia content farm.

Originally posted by @bd808 in #2062 (comment)

kelson42 · 2024-07-22T18:19:54Z

@audiodude I guess you already know my opinion on this! ;)

audiodude · 2024-07-22T19:17:29Z

Sorry, I was confused because mwoffliner scrapes many wikis, not just WMF ones.

But I suppose that the only wikis with map tiles will be the ones that are hosted on valid WMF domains, so this could definitely work.

bd808 · 2024-07-22T19:51:00Z

Sorry, I was confused because mwoffliner scrapes many wikis, not just WMF ones.

But I suppose that the only wikis with map tiles will be the ones that are hosted on valid WMF domains, so this could definitely work.

If anyone is using mwoffliner to scrape non-Wikimedia wikis and also attempting to consume map tiles from the Wikimedia tile server as part of that scraping they should get the 403 rejection unless their hosting is at an authorized domain. This is the point of using "the very poor protection" of checking the HTTP referrer. We want the Wikimedia movement to be able to make use of the Wikimedia tile server, but we also have limited resources to devote to that title server and cannot scale it to serving map tiles for any and all on the Internet.

The currently implemented solution looks exactly like a bad faith actor deciding to circumvent the loose protections we have implemented. You read technical details about the service that we published in good faith to provide transparency for the Wikimedia community and then weaponized them against the very project that y'all claim to be attempting to advance. It is not a good look.

audiodude · 2024-07-22T20:00:16Z

Sorry, I was confused because mwoffliner scrapes many wikis, not just WMF ones.
But I suppose that the only wikis with map tiles will be the ones that are hosted on valid WMF domains, so this could definitely work.

If anyone is using mwoffliner to scrape non-Wikimedia wikis and also attempting to consume map tiles from the Wikimedia tile server as part of that scraping they should get the 403 rejection unless their hosting is at an authorized domain. This is the point of using "the very poor protection" of checking the HTTP referrer. We want the Wikimedia movement to be able to make use of the Wikimedia tile server, but we also have limited resources to devote to that title server and cannot scale it to serving map tiles for any and all on the Internet.

Yes exactly. That's the original point that I missed: that we shouldn't expect map tiles to show up on say Minecraft Wiki, and that if they do, we shouldn't expect them to work.

The currently implemented solution looks exactly like a bad faith actor deciding to circumvent the loose protections we have implemented. You read technical details about the service that we published in good faith to provide transparency for the Wikimedia community and then weaponized them against the very project that y'all claim to be attempting to advance. It is not a good look.

To be fair, our discussion does indicate that we saw the current solution to be a temporary workaround while we sought to obtain the proper permissions. I think you should consider our opening of the phabricator ticket and brining light to the issue to be a good faith effort towards that goal.

bd808 · 2024-07-23T20:16:12Z

Thanks for the quick attention folks.

I'm also sorry if I was overly aggressive in my criticisms of the original work around. My bad days shouldn't be everyone else's problem.

bd808 mentioned this issue Jul 22, 2024

Wikimedia Maps HTTP 403 (acting like a bad bot) #2061

Closed

kelson42 assigned audiodude Jul 22, 2024

kelson42 added enhancement question labels Jul 22, 2024

kelson42 added this to the 1.15.0 milestone Jul 22, 2024

audiodude modified the milestones: 1.15.0, 1.14.0 Jul 23, 2024

audiodude mentioned this issue Jul 23, 2024

Use wiki href as proper Referer header #2068

Merged

kelson42 closed this as completed in #2068 Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use scraped site as referrer rather than a "hack" #2067

Use scraped site as referrer rather than a "hack" #2067

bd808 commented Jul 22, 2024 •

edited

Loading

kelson42 commented Jul 22, 2024

audiodude commented Jul 22, 2024

bd808 commented Jul 22, 2024

audiodude commented Jul 22, 2024

bd808 commented Jul 23, 2024

Use scraped site as referrer rather than a "hack" #2067

Use scraped site as referrer rather than a "hack" #2067

Comments

bd808 commented Jul 22, 2024 • edited Loading

kelson42 commented Jul 22, 2024

audiodude commented Jul 22, 2024

bd808 commented Jul 22, 2024

audiodude commented Jul 22, 2024

bd808 commented Jul 23, 2024

bd808 commented Jul 22, 2024 •

edited

Loading