Special handling for known websites (WP, youtube, ted, etc) #33

Popolechien · 2021-09-28T07:23:35Z

I see that almost every day (and certainly several times a week) people are running requests for Wikipedia, Wikibooks or even Youtube.
Zimit should be able to a) switch gears to run the corresponding scrapers (youtube), or directly offer the latest zim available (wikipedia, wikibooks).

rgaudin · 2021-09-28T08:36:24Z

No, we've discussed that a while back and apparently, we did not create ticket but the idea was to have a list of known websites for which we refuses request and display a message explaining where to find already existing ZIMs.

Switching scraper is not practical for many reasons ; mainly because we have no limit on those other scrapers

Popolechien · 2021-09-28T08:39:59Z

display a message explaining where to find already existing ZIMs.

Sounds good to me and was the main point, but then the response message should identify the target and corresponding zim (e.g. "here is the link to en.wikipedia.org's latest in available" and not "got to download.kiwix.org/zim and figure it out".

rgaudin · 2021-09-28T08:50:47Z

Ideally, yes. It can probably be implemented in two steps so that this gets a chance to be done.

At first, we can redirect to the Wiki where files are listed. Or maybe the library with new kiwix-serve is considered easy-enough ?

First thing you can do is list the domains and where to point to. It's easy for those we have a category for.
Youtube will require special treatment anyway as we don't have ready made ZIMs for all. I see two options:

we keep it as it is, but add a message on request saying this is probably not what they want and both link to the scaper and the contact form to request a custom ZIM.
or we block the request and show a similar message

Popolechien · 2021-09-30T13:25:55Z

Or maybe the library with new kiwix-serve is considered easy-enough ?

This would have my preference by far, but when I look at domains, based on the past three months (and this doc) I think we can simply send them to wikipedia_en_all.zim

kelson42 · 2021-09-30T13:33:23Z

We could have a ZIM metadata "source_url" and then allow library.kiwix.org to filter on it?

rgaudin · 2021-09-30T13:44:59Z

We could have a ZIM metadata "source_url" and then allow library.kiwix.org to filter on it?

Yes, that's an interesting feature for which the default behavior might be tricky: how much matching do you want? domain? netloc ? path ? scheme ? but yeah, that would be best for us.

stale · 2022-03-02T11:46:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale · 2023-05-26T18:46:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Popolechien added the enhancement New feature or request label Sep 28, 2021

rgaudin changed the title ~~Zimit should identify jobs for which another scraper (or zim) is available~~ Special handling for known websites (WP, youtube, ted, etc) Sep 28, 2021

kelson42 added the question Further information is requested label Sep 30, 2021

stale bot added the stale label Mar 2, 2022

rgaudin transferred this issue from openzim/zimit Feb 1, 2023

stale bot removed stale labels Feb 1, 2023

stale bot added the stale label May 26, 2023

kelson42 mentioned this issue Nov 4, 2023

Blacklist requests that are duplicates of existing resources or bound to fail #28

Open

benoit74 mentioned this issue Jun 4, 2024

youzim-it is pending my requests too long ! #55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special handling for known websites (WP, youtube, ted, etc) #33

Special handling for known websites (WP, youtube, ted, etc) #33

Popolechien commented Sep 28, 2021

rgaudin commented Sep 28, 2021

Popolechien commented Sep 28, 2021

rgaudin commented Sep 28, 2021

Popolechien commented Sep 30, 2021

kelson42 commented Sep 30, 2021 •

edited

Loading

rgaudin commented Sep 30, 2021

stale bot commented Mar 2, 2022

stale bot commented May 26, 2023

Special handling for known websites (WP, youtube, ted, etc) #33

Special handling for known websites (WP, youtube, ted, etc) #33

Comments

Popolechien commented Sep 28, 2021

rgaudin commented Sep 28, 2021

Popolechien commented Sep 28, 2021

rgaudin commented Sep 28, 2021

Popolechien commented Sep 30, 2021

kelson42 commented Sep 30, 2021 • edited Loading

rgaudin commented Sep 30, 2021

stale bot commented Mar 2, 2022

stale bot commented May 26, 2023

kelson42 commented Sep 30, 2021 •

edited

Loading