[gelbooru] Favorites API Issues #5220

Hawker2 · 2024-02-19T21:59:39Z

We've discussed this before (#4769), but the Gelbooru favorites API has been inconsistent in how it returns results, and sometimes doesn't even have the same behavior for different user ids. These are all interrelated, which is why I have it as one big issue; I can split these as makes sense for you.

Right now, gallery-dl assumes that results will be sorted by date added in ascending order, and that it can reach the end of the results (and newest posts) by jumping to pid = count / 100.

Sometimes the API sorts by the wrong column
Sometimes favorites are returned sorted by the wrong column (favorite/post id rather than date added). Not much can be done about this, but in theory gallery-dl could notice this odd situation and print an error.

Favorites API returns results descending vs ascending
Sometimes favorites are sorted in the opposite order than is expected (ascending vs descending). While descending makes the most sense - show the latest additions first - gallery-dl assumes ascending based on past API behavior. Ideally it could determine which way results are sorted and adapt appropriately. A nifty trick would be to change the initial API call that gets the post count (limit=1) and change it to limit=2, then compare the "added" fields. Depending on which is greater, you can tell which way the results are sorted (assuming they're sorting on the correct column). As a bonus, if results are descending then you can start with pid=0 rather than trying to locate the end of the results - which has its own issues.

Gelbooru's post count doesn't always align with the paginated results
There appear to be gaps of some kind in the favorites API results (likely due to deleted posts); the end result is there are sometimes more pages of results than a simple division would indicate. For example:

% curl -s "https://gelbooru.com:443/index.php?page=dapi&q=index&json=1&s=favorite&id=17627&limit=1&pid=0" | jq | grep count       
    "count": 23686

If I take them 10 at a time, then favorites should end on page 2368, right? Not so fast...

% curl -s "https://gelbooru.com:443/index.php?page=dapi&q=index&json=1&s=favorite&id=17627&limit=10&pid=2368" | jq | grep favorite
  "favorite": [
      "favorite": 3872373,
      "favorite": 3872355,
      "favorite": 3866539,
      "favorite": 3860237,
      "favorite": 3860064,
      "favorite": 3860045,
      "favorite": 3860035,
      "favorite": 3860034,
      "favorite": 3860030,
      "favorite": 3860020,
% curl -s "https://gelbooru.com:443/index.php?page=dapi&q=index&json=1&s=favorite&id=17627&limit=10&pid=2369" | jq | grep favorite
  "favorite": [
      "favorite": 3859911,
      "favorite": 3859581,
      "favorite": 3858632,
      "favorite": 3858091,
      "favorite": 3855734,
      "favorite": 3852730,
      "favorite": 3852432,
      "favorite": 3852413,
      "favorite": 3851028,
      "favorite": 3850872,
% curl -s "https://gelbooru.com:443/index.php?page=dapi&q=index&json=1&s=favorite&id=17627&limit=10&pid=2370" | jq | grep favorite
  "favorite": [
      "favorite": 3848281,
      "favorite": 3847866,
      "favorite": 3846794,
      "favorite": 3846793,
      "favorite": 3846086,
      "favorite": 3845636,
      "favorite": 3844086,
      "favorite": 3843901,
      "favorite": 3843434,
      "favorite": 3843034,
% curl -s "https://gelbooru.com:443/index.php?page=dapi&q=index&json=1&s=favorite&id=17627&limit=10&pid=2371" | jq | grep favorite
  "favorite": [
      "favorite": 3842758,
      "favorite": 3842359,

A post count of 23686, taken 10 posts at a time with zero-indexing, would be assumed to finish on page 2368 (this is what gallery-dl does, except it uses pages of the API limit of 100 - as it should). But as you can see above, the actual end of results was on page 2371! And indeed even if sorting were working as expected, gallery-dl would only check out to page 236, and miss the last dozen favorites above.

One way of fixing this would be checking the number of results from the last page, and if you have API_LIMIT results returned, check the additional pages until you have a number of results that does not equal API_LIMIT (might be zero as an edge case of perfect alignment). With my suggestion above of checking the sort direction, this means this check might have to be taken either at the beginning (trying to find the end of posts to start with for an ascending sort) or at the end (proceeding all the way to the end in a descending sort).

EDIT: I thought more about this, and I think you'd actually have to go until you get zero results. It's possible to get less than API_LIMIT results - but still have more results remaining - if deleted posts are in the current result set.

gallery-dl still makes a lot of API calls even with use of an archive file
While working on troubleshooting this, I noticed that even when everything has been downloaded and is in an archive file, gallery-dl is still making 2 API calls per already-downloaded post. For instance:

[urllib3.connectionpool][debug] https://gelbooru.com:443 "GET /index.php?page=dapi&q=index&json=1&id=3850872&s=post&api_key=[redacted]&user_id=17627 HTTP/1.1" 200 None
[urllib3.connectionpool][debug] https://gelbooru.com:443 "GET /index.php?page=post&s=view&id=3850872 HTTP/1.1" 200 None

I assume this is because it's trying to download the metadata before checking the archive (or I need to add a separate archive file for metadata?) and because it needs to download the post to get the image URL and check that against the archive. To be clear, it's not downloading actual images that are already on disk; just extra API calls. However, ideally it should be able to tell that a given post ID is already in the archive just from that ID and preempt these API calls completely.

So yeah... Gelbooru's favorite API is difficult. It's not consistent, it's not paginated as would be expected, and it can end up expensive in terms of API calls. For instance, with how things stand right now with my 23686 favorites, I estimate it would take 47610 API calls (two per post, 237 pages of results, and 1 to determine initial count) before gallery-dl would even start seeing new posts to download, even with archive databases. The root cause is the inconsistent API results of course, but it would be nice if the extra API calls per post could be eliminated. Then I could churn through favorites results regardless of how they're being returned by the API.

The text was updated successfully, but these errors were encountered:

mikf · 2024-02-20T14:38:28Z

I assume this is because it's trying to download the metadata before checking the archive (or I need to add a separate archive file for metadata?) and because it needs to download the post to get the image URL and check that against the archive

Correct. This is one of gallery-dl's many flaws in its internal design. It has to gather all metadata first before it can check whether or not a file has already been downloaded.

You should, however, be able to manually skip over a concrete number of files without them causing any API calls by using --range N-, where N is the index of the first file gallery-dl will fully check and download.

Hawker2 · 2024-02-20T17:57:36Z

Correct. This is one of gallery-dl's many flaws in its internal design. It has to gather all metadata first before it can check whether or not a file has already been downloaded.

Makes sense. You could special case something to just make it work, but then you end up over time with an unmanageable number of special cases. Been there. :) And of course, this isn't an issue if gallery-dl can quickly reach the new posts and hit a skip threshold to stop sanely.

You should, however, be able to manually skip over a concrete number of files without them causing any API calls by using --range N-, where N is the index of the first file gallery-dl will fully check and download.

That's what I ended up doing, but then that means managing a running target for the range start. But it did resolve the issue for the moment - thank you.

At least under the current API behavior, detecting the sort direction would have the greatest impact. A sort in the opposite order than what's intended is always one of the most pathological cases for any algorithm.

9696neko · 2024-03-16T22:54:29Z

I also saw the out of order/wrong column problem and commented in #4769.

mikf · 2024-03-19T02:25:38Z

Gelbooru seems to be returning favorites in descending fav order now, at least it does for me for all accounts.

I've finally implemented the mentioned auto-order detection so that it'll always return favorites in descending / newest first order (0d69af9) at least as long as favorites are sorted by favorite ID / date added.

When not having to reverse favorite order, like it is right now, gsllery-dl will now only stop when the list of returned favorites is empty and no longer when the reported number of favs is reached.

I've also added a date_favorited metadata field (6d93295).

* save cookies to tempfile, then rename avoids wiping the cookies file if the disk is full * [deviantart:stash] fix 'index' metadata (mikf#5335) * [deviantart:stash] recognize 'deviantart.com/stash/…' URLs * [gofile] fix extraction * [kemonoparty] add 'revision_count' metadata field (mikf#5334) * [kemonoparty] add 'order-revisions' option (mikf#5334) * Fix imagefap extrcator * [twitter] add 'birdwatch' metadata field (mikf#5317) should probably get a better name, but this is what it's called internally by Twitter * [hiperdex] update URL patterns & fix 'manga' metadata (mikf#5340) * [flickr] add 'contexts' option (mikf#5324) * [tests] show full path for nested values 'user.name' instead of just 'name' when testing for "user": { … , "name": "…", … } * [bluesky] add 'instance' metadata field (mikf#4438) * [vipergirls] add 'like' option (mikf#4166) * [vipergirls] add 'domain' option (mikf#4166) * [gelbooru] detect returned favorites order (mikf#5220) * [gelbooru] add 'date_favorited' metadata field * Update fapello.py get fullsize image instead resized * fapello.py Fullsize image by remove ".md" and ".th" in image url, it will download fullsize of images * [formatter] fix local DST datetime offsets for ':O' 'O' would get the *current* local UTC offset and apply it to all 'datetime' objects it gets applied to. This would result in a wrong offset if the current offset includes DST and the target 'datetime' does not or vice-versa. 'O' now determines the correct local UTC offset while respecting DST for each individual 'datetime'. * [subscribestar] fix 'date' metadata * [idolcomplex] support new pool URLs * [idolcomplex] fix metadata extraction - replace legacy 'id' vales with alphanumeric ones, since the former are no longer available - approximate 'vote_average', since the real value is no longer available - fix 'vote_count' * [bunkr] remove 'description' metadata album descriptions are no longer available on album pages and the previous code erroneously returned just '0' * [deviantart] improve 'index' extraction for stash files (mikf#5335) * [kemonoparty] fix exception for '/revision/' URLs caused by 03a9ce9 * [steamgriddb] raise proper exception for deleted assets * [tests] update extractor results * [pornhub:gif] extract 'viewkey' and 'timestamp' metadata (mikf#4463) mikf#4463 (comment) * [tests] use 'datetime.timezone.utc' instead of 'datetime.UTC' 'datetime.UTC' was added in Python 3.11 and is not defined in older versions. * [gelbooru] add 'order-posts' option for favorites (mikf#5220) * [deviantart] handle CloudFront blocks in general (mikf#5363) This was already done for non-OAuth requests (mikf#655) but CF is now blocking OAuth API requests as well. * release version 1.26.9 * [kemonoparty] fix KeyError for empty files (mikf#5368) * [twitter] fix pattern for single tweet (mikf#5371) - Add optional slash - Update tests to include some non-standard tweet URLs * [kemonoparty:favorite] support 'sort' and 'order' query params (mikf#5375) * [kemonoparty] add 'announcements' option (mikf#5262) mikf#5262 (comment) * [wikimedia] suppress exception for entries without 'imageinfo' (mikf#5384) * [docs] update defaults of 'sleep-request', 'browser', 'tls12' * [docs] complete Authentication info in supportedsites.md * [twitter] prevent crash when extracting 'birdwatch' metadata (mikf#5403) * [workflows] build complete docs Pages only on gdl-org/docs deploy only docs/oauth-redirect.html on mikf.github.io/gallery-dl * [docs] document 'actions' (mikf#4543) or at least attempt to * store 'match' and 'groups' in Extractor objects * [foolfuuka] improve 'board' pattern & support pages (mikf#5408) * [reddit] support comment embeds (mikf#5366) * [build] add minimal pyproject.toml * [build] generate sdist and wheel packages using 'build' module * [build] include only the latest CHANGELOG entries The CHANGELOG is now at a size where it takes up roughly 50kB or 10% of an sdist or wheel package. * [oauth] use Extractor.request() for HTTP requests (mikf#5433) Enables using proxies and general network options. * [kemonoparty] fix crash on posts with missing datetime info (mikf#5422) * restore LD_LIBRARY_PATH for PyInstaller builds (mikf#5421) * remove 'contextlib' imports * [pp:ugoira] log errors for general exceptions * [twitter] match '/photo/' Tweet URLs (mikf#5443) fixes regression introduced in 40c0553 * [pp:mtime] do not overwrite '_mtime' for None values (mikf#5439) * [wikimedia] fix exception for files with empty 'metadata' * [wikimedia] support wiki.gg wikis * [pixiv:novel] add 'covers' option (mikf#5373) * [tapas] add 'creator' extractor (mikf#5306) * [twitter] implement 'relogin' option (mikf#5445) * [docs] update docs/configuration links (mikf#5059, mikf#5369, mikf#5423) * [docs] replace AnchorJS with custom script use it in rendered .rst documents as well as in .md ones * [text] catch general Exceptions * compute tempfile path only once * Add warnings flag This commit adds a warnings flag It can be combined with -q / --quiet to display warnings. The intent is to provide a silent option that still surfaces warning and error messages so that they are visible in logs. * re-order verbose and warning options * [gelbooru] improve pagination logic for meta tags (mikf#5478) similar to 494acab * [common] add Extractor.input() method * [twitter] improve username & password login procedure (mikf#5445) - handle more subtasks - support 2FA - support email verification codes * [common] update Extractor.wait() message format * [common] simplify 'status_code' check in Extractor.request() * [common] add 'sleep-429' option (mikf#5160) * [common] fix NameError in Extractor.request() … when accessing 'code' after an requests exception was raised. Caused by the changes in 566472f * [common] show full URL in Extractor.request() error messages * [hotleak] download files with 404 status code (mikf#5395) * [pixiv] change 'sanity_level' debug message to a warning (mikf#5180) * [twitter] handle missing 'expanded_url' fields (mikf#5463, mikf#5490) * [tests] allow filtering extractor result tests by URL or comment python test_results.py twitter:+/i/web/ python test_results.py twitter:~twitpic * [exhentai] detect CAPTCHAs during login (mikf#5492) * [output] extend 'output.colors' (mikf#2566) allow specifying ANSI colors for all loglevels (debug, info, warning, error) * [output] enable colors by default * add '--no-colors' command-line option --------- Co-authored-by: Luc Ritchie <[email protected]> Co-authored-by: Mike Fährmann <[email protected]> Co-authored-by: Herp <[email protected]> Co-authored-by: wankio <[email protected]> Co-authored-by: fireattack <[email protected]> Co-authored-by: Aidan Harris <[email protected]>

mikf added external-issue site:bug site:enhancement labels Feb 20, 2024

mikf mentioned this issue Feb 25, 2024

Enabling [booru].tags and [booru].notes trigger their requests even if the file is skipped #5238

Closed

mikf added a commit that referenced this issue Mar 18, 2024

[gelbooru] detect returned favorites order (#5220)

0d69af9

mikf added a commit that referenced this issue Mar 23, 2024

[gelbooru] add 'order-posts' option for favorites (#5220)

31e7ca7

mikf closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gelbooru] Favorites API Issues #5220

[gelbooru] Favorites API Issues #5220

Hawker2 commented Feb 19, 2024 •

edited

Loading

mikf commented Feb 20, 2024

Hawker2 commented Feb 20, 2024 •

edited

Loading

9696neko commented Mar 16, 2024

mikf commented Mar 19, 2024 •

edited

Loading

[gelbooru] Favorites API Issues #5220

[gelbooru] Favorites API Issues #5220

Comments

Hawker2 commented Feb 19, 2024 • edited Loading

mikf commented Feb 20, 2024

Hawker2 commented Feb 20, 2024 • edited Loading

9696neko commented Mar 16, 2024

mikf commented Mar 19, 2024 • edited Loading

Hawker2 commented Feb 19, 2024 •

edited

Loading

Hawker2 commented Feb 20, 2024 •

edited

Loading

mikf commented Mar 19, 2024 •

edited

Loading