Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: increase file scanning performance #486

Merged
merged 3 commits into from
Sep 12, 2024

Conversation

CyanVoxel
Copy link
Member

This pull request significantly increases file scanning performance, especially for rescanning libraries and for libraries with a large file to folder ratio.

Using a test library with around ~200k files (~100k filtered out via the exclusion list) on a standard 5400 RPM HDD, and found an average increase in library rescanning times of around 23x:

v9.4.0:
[LIBRARY]: Scanned directories in 88.160 seconds
[LIBRARY]: Scanned directories in 87.398 seconds
[LIBRARY]: Scanned directories in 90.164 seconds
[LIBRARY]: Scanned directories in 88.031 seconds
[LIBRARY]: Scanned directories in 87.657 seconds

Optimized:
[LIBRARY]: Scanned directories in 3.823 seconds
[LIBRARY]: Scanned directories in 3.796 seconds
[LIBRARY]: Scanned directories in 3.771 seconds
[LIBRARY]: Scanned directories in 3.804 seconds
[LIBRARY]: Scanned directories in 3.901 seconds

Much of the existing scanning code was either in an unideal order, could have been deferred elsewhere, or was straight up unnecessary. The most impactful changes I made were:

  • Only performing the f.parts and f.is_dir() checks for files that weren't already mapped in the library.
  • Removing the need for the relative_to() method entirely by including the library_dir inside the filename_to_entry_id_map keys themselves. Most of the code accessing these keys had to tack on the library_dir anyway, so this improves performance in those areas as well.

Some of these techniques can be ported to main, however I started with the v9.4 branch since I was more familiar with this codebase and figured it would be a nice inclusion for a patch.

@CyanVoxel CyanVoxel added Type: QoL A quality of life (QoL) enhancement or suggestion Status: Review Needed A review of this is needed Type: File System File system interactions labels Sep 10, 2024
@CyanVoxel CyanVoxel added this to the Alpha v9.4.1 milestone Sep 10, 2024
@CyanVoxel CyanVoxel added the Priority: Medium An issue that shouldn't be be saved for last label Sep 10, 2024
Comment on lines 896 to 898
if (ext not in ext_set and self.is_exclude_list) or (
ext in ext_set and not self.is_exclude_list
):
Copy link
Collaborator

@yedpodtrzitko yedpodtrzitko Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as I left for myself in #494:

maybe we should add all files into the library (no matter if they are in the ignore list), and only filter them when querying the library. Because the ignore list can eventually change, and it's not intuitive that you have to rescan the directory once again.

Comment on lines 896 to 901
if (ext not in ext_set and self.is_exclude_list) or (
ext in ext_set and not self.is_exclude_list
):
if f.suffix.lower() not in self.ext_list and self.is_exclude_list:
self.dir_file_count += 1
file = f.relative_to(self.library_dir)
if file not in self.filename_to_entry_id_map:
self.files_not_in_library.append(file)
elif f.suffix.lower() in self.ext_list and not self.is_exclude_list:
# If the file/path is not mapped in the Library:
if self.filename_to_entry_id_map.get(f) is None:
# Discard paths and files in unwanted folders.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd switch the logic here to avoid unnecessary code nesting. Instead of:

if thing:
    if another_thing:
        if yet_another_thing:
            do_thing()

you can have

if not thing:
    continue
    
if not another_thing:
    continue

if not yet_another_thing:
    continue

do_thing()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now flattened out this block by using continue breaks, however there's just one quirk...
This check in the middle of the block,

if self.filename_to_entry_id_map.get(f) is not None:
      self.dir_file_count += 1
      continue

was formerly part of an if-else and serves as an early exit for the loop. I have this placed before the f.is_dir() and other checks because performing those on files that are already counted in the library is both unnecessary and a performance killer (around 20x slower in my testing). As a result I've had to move the yield to the top of the for loop so it'll always get triggered each iteration.

It comes off as a bit of a code smell to me, so maybe this case is better off as a traditional conditional. Let me know what you think and I can revert it if needed. I'm also ready to cut the exclude list step depending on where we land on that.

Copy link
Collaborator

@yedpodtrzitko yedpodtrzitko Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you feel like you're going an extra mile just to remove the if-else, then it's probably not worth it in that particular case. But in case of if without any else it's usually the better option.

When I was touching this particular piece code on main branch, I was thinking it could be easier to have loop counter, and yield like every 20 files or so instead of keeping and resetting an extra variable:

for i, f in enumerate(self.library_dir.glob("**/*")):
    
    ...
    if not i % 20:
        yield self.dir_file_count

I'll leave it up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you feel like you're going an extra mile just to remove the if-else, then it's probably not worth it in that particular case. But in case of if without any else it's usually the better option.

Sounds good, I may leave it be in this case since it's a terminal branch after all but I'm better equipped for avoiding nested if x, if y, if z, etc in the future.

When I was touching this particular piece code on main branch, I was thinking it could be easier to have loop counter, and yield like every 20 files or so instead of keeping and resetting an extra variable

The intention behind the local timers for the yield here is to throttle the number updates per second, so that the calling function doesn't find itself overwhelmed at any point in time - in this case the Qt progress bar. I found that 1/30 of a second was around the point where the number of draw calls for the progress bar no longer added additional overhead to the loop, and any further throttling had little to no impact.

The issue with only yielding every x files is that this rate becomes bound to how quickly the files are scanned, rather than what can be reasonably processed by a UI. If files scan very quickly due to better hardware or new optimizations, then this magic number would now be adding additional overhead with too many draw calls at once. And if the scanning is slower, then the magic number is overcompensating with a slow and inaccurate readout. On my machine and test library I needed to increase the magic number in the modulus operation to around 3,000 before it stopped negatively impacting performance in comparison, at which point the yield readout becomes much less useful. Not to mention that increasing it further did not offer any performance improvements over the timer assignments.

@CyanVoxel CyanVoxel removed the Status: Review Needed A review of this is needed label Sep 12, 2024
@CyanVoxel CyanVoxel merged commit 6490cc9 into Alpha-v9.4 Sep 12, 2024
8 checks passed
@CyanVoxel CyanVoxel deleted the scan-optimizations-v9.4.x branch September 12, 2024 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium An issue that shouldn't be be saved for last Type: File System File system interactions Type: QoL A quality of life (QoL) enhancement or suggestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants