-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: increase file scanning performance #486
Conversation
tagstudio/src/core/library.py
Outdated
if (ext not in ext_set and self.is_exclude_list) or ( | ||
ext in ext_set and not self.is_exclude_list | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as I left for myself in #494:
maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.
tagstudio/src/core/library.py
Outdated
if (ext not in ext_set and self.is_exclude_list) or ( | ||
ext in ext_set and not self.is_exclude_list | ||
): | ||
if f.suffix.lower() not in self.ext_list and self.is_exclude_list: | ||
self.dir_file_count += 1 | ||
file = f.relative_to(self.library_dir) | ||
if file not in self.filename_to_entry_id_map: | ||
self.files_not_in_library.append(file) | ||
elif f.suffix.lower() in self.ext_list and not self.is_exclude_list: | ||
# If the file/path is not mapped in the Library: | ||
if self.filename_to_entry_id_map.get(f) is None: | ||
# Discard paths and files in unwanted folders. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd switch the logic here to avoid unnecessary code nesting. Instead of:
if thing:
if another_thing:
if yet_another_thing:
do_thing()
you can have
if not thing:
continue
if not another_thing:
continue
if not yet_another_thing:
continue
do_thing()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now flattened out this block by using continue
breaks, however there's just one quirk...
This check in the middle of the block,
if self.filename_to_entry_id_map.get(f) is not None:
self.dir_file_count += 1
continue
was formerly part of an if-else
and serves as an early exit for the loop. I have this placed before the f.is_dir()
and other checks because performing those on files that are already counted in the library is both unnecessary and a performance killer (around 20x slower in my testing). As a result I've had to move the yield
to the top of the for
loop so it'll always get triggered each iteration.
It comes off as a bit of a code smell to me, so maybe this case is better off as a traditional conditional. Let me know what you think and I can revert it if needed. I'm also ready to cut the exclude list step depending on where we land on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you feel like you're going an extra mile just to remove the if-else, then it's probably not worth it in that particular case. But in case of if
without any else
it's usually the better option.
When I was touching this particular piece code on main
branch, I was thinking it could be easier to have loop counter, and yield
like every 20 files or so instead of keeping and resetting an extra variable:
for i, f in enumerate(self.library_dir.glob("**/*")):
...
if not i % 20:
yield self.dir_file_count
I'll leave it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you feel like you're going an extra mile just to remove the if-else, then it's probably not worth it in that particular case. But in case of if without any else it's usually the better option.
Sounds good, I may leave it be in this case since it's a terminal branch after all but I'm better equipped for avoiding nested if x, if y, if z
, etc in the future.
When I was touching this particular piece code on main branch, I was thinking it could be easier to have loop counter, and yield like every 20 files or so instead of keeping and resetting an extra variable
The intention behind the local timers for the yield here is to throttle the number updates per second, so that the calling function doesn't find itself overwhelmed at any point in time - in this case the Qt progress bar. I found that 1/30 of a second was around the point where the number of draw calls for the progress bar no longer added additional overhead to the loop, and any further throttling had little to no impact.
The issue with only yielding every x files is that this rate becomes bound to how quickly the files are scanned, rather than what can be reasonably processed by a UI. If files scan very quickly due to better hardware or new optimizations, then this magic number would now be adding additional overhead with too many draw calls at once. And if the scanning is slower, then the magic number is overcompensating with a slow and inaccurate readout. On my machine and test library I needed to increase the magic number in the modulus operation to around 3,000 before it stopped negatively impacting performance in comparison, at which point the yield readout becomes much less useful. Not to mention that increasing it further did not offer any performance improvements over the timer assignments.
This pull request significantly increases file scanning performance, especially for rescanning libraries and for libraries with a large file to folder ratio.
Using a test library with around ~200k files (~100k filtered out via the exclusion list) on a standard 5400 RPM HDD, and found an average increase in library rescanning times of around 23x:
Much of the existing scanning code was either in an unideal order, could have been deferred elsewhere, or was straight up unnecessary. The most impactful changes I made were:
f.parts
andf.is_dir()
checks for files that weren't already mapped in the library.relative_to()
method entirely by including thelibrary_dir
inside thefilename_to_entry_id_map
keys themselves. Most of the code accessing these keys had to tack on thelibrary_dir
anyway, so this improves performance in those areas as well.Some of these techniques can be ported to
main
, however I started with the v9.4 branch since I was more familiar with this codebase and figured it would be a nice inclusion for a patch.