Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache calculated hashes for requirements and _InstallRequirementBackedCandidate #12657

Conversation

notatallshaw
Copy link
Member

@notatallshaw notatallshaw commented Apr 28, 2024

This builds on @sbidoul PR #12453.

I was profiling the scenario here #12314 (comment), and found over 30% of the time is spent calculating the hash of specifier requirements, which is mostly spent calculating the string of the install requirement, and is done repeatedly.

After this PR installing all the homeassistant wheels from local directory went from ~48 seconds to ~35 seconds, and dry-run install of apache-airflow[all] with all packages cached on Python 3.12 went from ~4 minutes to ~3 minutes.

@notatallshaw notatallshaw changed the title Eagerly calculate strings and hashes for requirements and requirement Eagerly calculate strings and hashes for requirements and _InstallRequirementBackedCandidate Apr 28, 2024
news/12657.feature.rst Outdated Show resolved Hide resolved
@notatallshaw
Copy link
Member Author

notatallshaw commented Apr 28, 2024

Even running the very simple pip install requests --dry-run --ignore-installed I see about a 1-1.5% performance improvement over several dozen runs.

@notatallshaw notatallshaw mentioned this pull request Apr 28, 2024
@pfmoore
Copy link
Member

pfmoore commented Apr 28, 2024

Why not hash on demand? Initialise _hash to None, and calculate the first time through __hash__. It's probably marginal, but it seems like more common practice.

Also, this assumes the fields that are used to compute the hash never change. Do we have anything to ensure that's the case? I'm sure it is, but it would be a little annoying if some clever hack in the future violated that invariant and we didn't catch it.

@notatallshaw
Copy link
Member Author

notatallshaw commented Apr 28, 2024

Why not hash on demand? Initialise _hash to None, and calculate the first time through __hash__. It's probably marginal, but it seems like more common practice.

I can do that, but be aware it will almost certainly be a little slower, in the scenario of a large number of packages the __hash__ function is being called hundreds of thousands of times (because this is being called by resolvelib in an O(n^2) algorithm), so even a if self._hash is not None: return self._hash is going to add a little bit of time.

And I don't think there's a scenario where the hashes aren't being used, maybe in the case of --no-deps?

Also, this assumes the fields that are used to compute the hash never change. Do we have anything to ensure that's the case?

Only that the hash is dependent on internally marked attributes of internally marked apis, nothing in pip's codebase is changing them. Of course if someone is calling pip as a library and using the internal api and changing the internally marked attributed this will break, but I think lots of other assumptions will break if these attributes are changed, including, I think, the way #12453 was implemented.

@notatallshaw
Copy link
Member Author

so even a if self._hash is not None: return self._hash is going to add a little bit of time.

I performance tested this to see if this assumption held up in the face of real data, and found in general the performance difference was within the margin of error. So I've switched to @pfmoore's suggestion of lazily calculating hashes and caching their results.

@notatallshaw notatallshaw changed the title Eagerly calculate strings and hashes for requirements and _InstallRequirementBackedCandidate Cache calculated hashes for requirements and _InstallRequirementBackedCandidate Apr 28, 2024
@ichard26 ichard26 added the type: performance Commands take too long to run label Apr 29, 2024
@notatallshaw notatallshaw force-pushed the eagerly-calculate-strings-and-hashes-for-requirements-and-candidate branch from 78c36e4 to d15be97 Compare April 29, 2024 21:49
@notatallshaw notatallshaw force-pushed the eagerly-calculate-strings-and-hashes-for-requirements-and-candidate branch from d15be97 to 1e510e3 Compare April 29, 2024 21:57
@notatallshaw
Copy link
Member Author

Fixed MyPy errors, cleaned comments, rebased

@uranusjr uranusjr merged commit 411b981 into pypa:main Apr 30, 2024
24 checks passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants