Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_case_conflict is slow in a repo with many files/directories on windows #625

Closed
guykisel opened this issue Jul 9, 2021 · 2 comments · Fixed by #626
Closed

check_case_conflict is slow in a repo with many files/directories on windows #625

guykisel opened this issue Jul 9, 2021 · 2 comments · Fixed by #626

Comments

@guykisel
Copy link
Contributor

guykisel commented Jul 9, 2021

Hello!

I recently updated to a more recent version of this repo for my team's git hooks and we noticed that check_case_conflict got a lot slower. I believe I've tracked down what's going on and I have a suggested improvement, though I'm not sure if it's a great approach.

For context, this is the Legends of Runeterra game's monorepo :) It's got around 1m files and some are pretty deeply nested in directories (a little bit more on this at https://technology.riotgames.com/news/legends-runeterra-cicd-pipeline if anyone's curious)

Testing done using Python 3.6.3 on Windows 10 with pre-commit-hooks==4.0.1 and pre-commit==2.13.0

#575 introduced also checking directories, which is great! But this snippet ends up being slow because os.path.dirname is slow (at least on windows :( )

def parents(file: str) -> Iterator[str]:
    file = os.path.dirname(file)
    while file:
        yield file
        file = os.path.dirname(file)

I used cProfile to run this against a single file:

from pre_commit_hooks import check_case_conflict
import cProfile
cProfile.run('check_case_conflict.find_conflicting_filenames("Jenkinsfile")')

results:

λ py test.py
         41899740 function calls in 20.114 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.047    0.047   20.114   20.114 <string>:1(<module>)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:38(_remove)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
        3    0.000    0.000    0.277    0.092 check_case_conflict.py:13(lower_set)
        3    0.186    0.062    0.277    0.092 check_case_conflict.py:14(<setcomp>)
  3189815    1.179    0.000   17.790    0.000 check_case_conflict.py:17(parents)
        2    0.000    0.000   18.831    9.416 check_case_conflict.py:24(directories_for)
        2    1.042    0.521   18.831    9.416 check_case_conflict.py:25(<setcomp>)
        1    0.147    0.147   20.066   20.066 check_case_conflict.py:28(find_conflicting_filenames)
  3189815    3.710    0.000    5.528    0.000 ntpath.py:121(splitdrive)
  3189815    7.833    0.000   15.701    0.000 ntpath.py:199(split)
  3189815    0.910    0.000   16.611    0.000 ntpath.py:240(dirname)
  3189815    0.813    0.000    1.216    0.000 ntpath.py:33(_get_bothseps)
        2    0.000    0.000    0.000    0.000 subprocess.py:1023(_internal_poll)
        2    0.000    0.000    0.022    0.011 subprocess.py:1040(wait)
        2    0.000    0.000    0.665    0.332 subprocess.py:1067(_communicate)
       22    0.000    0.000    0.000    0.000 subprocess.py:179(Close)
...

I messed with it a bit locally and this approach seems to perform significantly better:

def parents(file: str) -> Iterator[str]:
    path_parts = file.split(os.sep)
    file = path_parts[-1]
    while path_parts:
        yield file
        file = path_parts.pop()
λ py test.py
         1962384 function calls in 2.469 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.042    0.042    2.469    2.469 <string>:1(<module>)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:38(_remove)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
        3    0.000    0.000    0.283    0.094 check_case_conflict.py:13(lower_set)
        3    0.195    0.065    0.283    0.094 check_case_conflict.py:14(<setcomp>)
   784800    0.369    0.000    0.670    0.000 check_case_conflict.py:17(parents)
        2    0.000    0.000    1.079    0.539 check_case_conflict.py:26(directories_for)
        2    0.409    0.204    1.079    0.539 check_case_conflict.py:27(<setcomp>)
        1    0.231    0.231    2.427    2.427 check_case_conflict.py:30(find_conflicting_filenames)
        2    0.000    0.000    0.000    0.000 subprocess.py:1023(_internal_poll)
        2    0.000    0.000    0.020    0.010 subprocess.py:1040(wait)
        2    0.000    0.000    0.695    0.348 subprocess.py:1067(_communicate)
...

But it doesn't really feel good doing this instead of using the built-in os.path stuff. I'm not really sure how correct/robust/multi-OS-compatible this is.

I'm curious if anyone has alternative ideas! But also if this is simply out of scope for this tool that's fine too :) Thanks for reading!

@guykisel guykisel changed the title check_case_conflict is slow in a repo with many files/directories check_case_conflict is slow in a repo with many files/directories on windows Jul 9, 2021
@asottile
Copy link
Member

asottile commented Jul 9, 2021

@guykisel good to see you again! sorry we made this slower! that profile looks good, pre-commit will always call the tool with forward slashes and if git returns paths consistently then I think we can do an optimization similar to the one you've proposed

there may be a smarter way to do this as well (for instance grouping by the depth of things might improve speed?) but I think your proposed change is probably good enough

@guykisel
Copy link
Contributor Author

guykisel commented Jul 9, 2021

Good to see you again too!

Cool, thanks for taking a look. I'll prep a PR with the proposed change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants