feat: increase file scanning performance #486

CyanVoxel · 2024-09-10T08:07:18Z

This pull request significantly increases file scanning performance, especially for rescanning libraries and for libraries with a large file to folder ratio.

Using a test library with around ~200k files (~100k filtered out via the exclusion list) on a standard 5400 RPM HDD, and found an average increase in library rescanning times of around 23x:

v9.4.0:
[LIBRARY]: Scanned directories in 88.160 seconds
[LIBRARY]: Scanned directories in 87.398 seconds
[LIBRARY]: Scanned directories in 90.164 seconds
[LIBRARY]: Scanned directories in 88.031 seconds
[LIBRARY]: Scanned directories in 87.657 seconds

Optimized:
[LIBRARY]: Scanned directories in 3.823 seconds
[LIBRARY]: Scanned directories in 3.796 seconds
[LIBRARY]: Scanned directories in 3.771 seconds
[LIBRARY]: Scanned directories in 3.804 seconds
[LIBRARY]: Scanned directories in 3.901 seconds

Much of the existing scanning code was either in an unideal order, could have been deferred elsewhere, or was straight up unnecessary. The most impactful changes I made were:

Only performing the f.parts and f.is_dir() checks for files that weren't already mapped in the library.
Removing the need for the relative_to() method entirely by including the library_dir inside the filename_to_entry_id_map keys themselves. Most of the code accessing these keys had to tack on the library_dir anyway, so this improves performance in those areas as well.

Some of these techniques can be ported to main, however I started with the v9.4 branch since I was more familiar with this codebase and figured it would be a nice inclusion for a patch.

yedpodtrzitko · 2024-09-11T04:40:04Z

tagstudio/src/core/library.py

+                if (ext not in ext_set and self.is_exclude_list) or (
+                    ext in ext_set and not self.is_exclude_list
                ):


same comment as I left for myself in #494:

maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.

yedpodtrzitko · 2024-09-11T04:49:05Z

tagstudio/src/core/library.py

+                if (ext not in ext_set and self.is_exclude_list) or (
+                    ext in ext_set and not self.is_exclude_list
                ):
-                    if f.suffix.lower() not in self.ext_list and self.is_exclude_list:
-                        self.dir_file_count += 1
-                        file = f.relative_to(self.library_dir)
-                        if file not in self.filename_to_entry_id_map:
-                            self.files_not_in_library.append(file)
-                    elif f.suffix.lower() in self.ext_list and not self.is_exclude_list:
+                    # If the file/path is not mapped in the Library:
+                    if self.filename_to_entry_id_map.get(f) is None:
+                        # Discard paths and files in unwanted folders.


I'd switch the logic here to avoid unnecessary code nesting. Instead of:

if thing: if another_thing: if yet_another_thing: do_thing()

you can have

if not thing: continue if not another_thing: continue if not yet_another_thing: continue do_thing()

I've now flattened out this block by using continue breaks, however there's just one quirk...
This check in the middle of the block,

if self.filename_to_entry_id_map.get(f) is not None: self.dir_file_count += 1 continue

was formerly part of an if-else and serves as an early exit for the loop. I have this placed before the f.is_dir() and other checks because performing those on files that are already counted in the library is both unnecessary and a performance killer (around 20x slower in my testing). As a result I've had to move the yield to the top of the for loop so it'll always get triggered each iteration.

It comes off as a bit of a code smell to me, so maybe this case is better off as a traditional conditional. Let me know what you think and I can revert it if needed. I'm also ready to cut the exclude list step depending on where we land on that.

if you feel like you're going an extra mile just to remove the if-else, then it's probably not worth it in that particular case. But in case of if without any else it's usually the better option.

When I was touching this particular piece code on main branch, I was thinking it could be easier to have loop counter, and yield like every 20 files or so instead of keeping and resetting an extra variable:

for i, f in enumerate(self.library_dir.glob("**/*")): ... if not i % 20: yield self.dir_file_count

I'll leave it up to you.

if you feel like you're going an extra mile just to remove the if-else, then it's probably not worth it in that particular case. But in case of if without any else it's usually the better option.

Sounds good, I may leave it be in this case since it's a terminal branch after all but I'm better equipped for avoiding nested if x, if y, if z, etc in the future.

When I was touching this particular piece code on main branch, I was thinking it could be easier to have loop counter, and yield like every 20 files or so instead of keeping and resetting an extra variable

The intention behind the local timers for the yield here is to throttle the number updates per second, so that the calling function doesn't find itself overwhelmed at any point in time - in this case the Qt progress bar. I found that 1/30 of a second was around the point where the number of draw calls for the progress bar no longer added additional overhead to the loop, and any further throttling had little to no impact.

The issue with only yielding every x files is that this rate becomes bound to how quickly the files are scanned, rather than what can be reasonably processed by a UI. If files scan very quickly due to better hardware or new optimizations, then this magic number would now be adding additional overhead with too many draw calls at once. And if the scanning is slower, then the magic number is overcompensating with a slow and inaccurate readout. On my machine and test library I needed to increase the magic number in the modulus operation to around 3,000 before it stopped negatively impacting performance in comparison, at which point the yield readout becomes much less useful. Not to mention that increasing it further did not offer any performance improvements over the timer assignments.

CyanVoxel added 2 commits September 10, 2024 00:44

feat: increase file scanning performance

e017dd8

fix: correct typo in comment

d8afa83

CyanVoxel added Type: QoL A quality of life (QoL) enhancement or suggestion Status: Review Needed A review of this is needed Type: File System File system interactions labels Sep 10, 2024

CyanVoxel added this to the Alpha v9.4.1 milestone Sep 10, 2024

CyanVoxel added the Priority: Medium An issue that shouldn't be be saved for last label Sep 10, 2024

yedpodtrzitko approved these changes Sep 11, 2024

View reviewed changes

yedpodtrzitko reviewed Sep 11, 2024

View reviewed changes

CyanVoxel mentioned this pull request Sep 11, 2024

cleanup the refresh_dir code, update tests #494

Merged

refactor: use continue in place of nested ifs

96248ec

CyanVoxel removed the Status: Review Needed A review of this is needed label Sep 12, 2024

CyanVoxel merged commit 6490cc9 into Alpha-v9.4 Sep 12, 2024
8 checks passed

CyanVoxel deleted the scan-optimizations-v9.4.x branch September 12, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: increase file scanning performance #486

feat: increase file scanning performance #486

CyanVoxel commented Sep 10, 2024

yedpodtrzitko Sep 11, 2024 •

edited

Loading

yedpodtrzitko Sep 11, 2024

CyanVoxel Sep 11, 2024

yedpodtrzitko Sep 11, 2024 •

edited

Loading

CyanVoxel Sep 11, 2024

feat: increase file scanning performance #486

feat: increase file scanning performance #486

Conversation

CyanVoxel commented Sep 10, 2024

yedpodtrzitko Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

yedpodtrzitko Sep 11, 2024

Choose a reason for hiding this comment

CyanVoxel Sep 11, 2024

Choose a reason for hiding this comment

yedpodtrzitko Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

CyanVoxel Sep 11, 2024

Choose a reason for hiding this comment

yedpodtrzitko Sep 11, 2024 •

edited

Loading

yedpodtrzitko Sep 11, 2024 •

edited

Loading