Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Autotag Optimizations #2366

Closed
kermieisinthehouse opened this issue Mar 6, 2022 · 1 comment · Fixed by #2439
Closed

[Feature] Autotag Optimizations #2366

kermieisinthehouse opened this issue Mar 6, 2022 · 1 comment · Fixed by #2439
Labels
bounty This issue has a bounty on it in the OpenCollective feature request feature Pull requests that add a new feature

Comments

@kermieisinthehouse
Copy link
Collaborator

Autotag performance currently leaves me unable to run autotag on my image library (just under 14 million images).

I'm requesting some specific optimizations so that autotag can be greatly sped up. Note that the old bulk autotag implementation (sqlite regex based), was much faster, but was less configurable and used more memory.

In scope:

  • A checkbox for tags / performers / studios that enables / disables eligibility to autotag. This can reduce the search space considerably AND allow people to run autotag without worrying about pollution of e.g. single name performers.
  • Memoization of compiled RE objects during an autotag run to save CPU
  • Folding all alias regexes into a single regex: all aliases match the same underlying ID, and assuming they are somewhat similar, we can quickly check them all in one optimized call to regexp. We can generate all of the matching regexes to all aliases, then concat them into a master regex of "REGEX1 | REGEX2 | REGEX3 | ...". The final compiled object will optimize this pretty well, especially if the tags share substrings, like "example tag phrase" and "example tag phrases".
  • A dedicated query function for the autotag task that doesn't use a sort by title. On large collections, sqlite sorts use crazy amounts of CPU time, when the order doesn't matter. The internal ROWID sort is consistent enough.
  • Indexes for images / scenes tables as necessary for new query

Not in scope: I am unsure of how much time the tag narrowing strategy currently saves. Is tokenizing the string and querying the database really faster than just naively doing all regex comparisons, especially if we only compile them once during task lifetime?

This was partially started in #1927, but it is not necessary to reuse any of it.

I am willing to put a decent bounty on this issue, as it is somewhat large and very useful to me.

@kermieisinthehouse kermieisinthehouse added feature Pull requests that add a new feature feature request labels Mar 6, 2022
@kermieisinthehouse kermieisinthehouse added this to the "Soon" milestone Mar 6, 2022
@WithoutPants
Copy link
Collaborator

Bounty placed for $251.

@WithoutPants WithoutPants added the bounty This issue has a bounty on it in the OpenCollective label Mar 7, 2022
@WithoutPants WithoutPants modified the milestones: "Soon", Version 0.14.0 Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty This issue has a bounty on it in the OpenCollective feature request feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants