Improve speed of Poetry on CI #332

amotl · 2021-01-24T17:35:54Z

Hi there,

@gutzbenj tried to improve speed on GHA CI through #328 and I later chimed in through #331.

We observed different problems, specifically when caching whole Python virtualenvs using GHA's actions/cache and are discussing it with @sondrelg at snok/install-poetry#18.

However, the recommendation is actually not to do it at all, see actions/cache#175. Otherwise, this will probably bring in more problems than not. We should really trust @webknjaz and @pradyunsg here:

Usually only pip's wheel cache should be cached in GHA. Caching the whole interpreter installation is going to create problems every time they upgrade it, see actions/cache#175 (comment).

Please don't cache site-packages or entire interpreter trees, see actions/cache#175 (comment).

The reason why it takes twice the time to install packages from the download cache into the environment on Windows is that Windows is just slow when handling thousands of small files, as @maphew suggested at actions/cache#175 (comment).

With kind regards,
Andreas.

The text was updated successfully, but these errors were encountered:

amotl · 2021-01-24T17:43:28Z

My observation currently is that the process might be slower with caching than without!?

-- https://github.com/earthobservations/wetterdienst/actions/runs/507624752

-- https://github.com/earthobservations/wetterdienst/actions/runs/507642040

amotl · 2021-01-24T19:22:32Z

Regarding performance improvements wrt. observations by others, python-poetry/poetry#2184 (comment) by @TheFriendlyCoder might be important to watch. Also, some comments at python-poetry/poetry#2094 might yield more insights.

cc @sondrelg

sondrelg · 2021-01-24T21:10:15Z

Don't have much to add here, but I guess whether adding caching for Windows runs is faster or not, should boil down to whether it would be faster to retrieve a cache or download packages from PyPi - everything that happens after that should be the same.

Looking at this run, it seems your total cache size is only 50mb, and is retrieved in 1s, so generally I would expect that to beat PyPi, but I guess this isn't really a bottleneck either way.

amotl · 2021-01-25T10:16:47Z

It really feels like Poetry is doing some kind of network roundtrips even if all dependencies can be satisfied from the cache. This is kind of reflected by observations from others, see #332 (comment).

webknjaz · 2021-01-25T10:21:12Z

Any correct resolver would need network anyway because if you request ==1.0.0, it'll also match ==1.0.0.post0, for example. And only the index would know if that exists.

amotl · 2021-01-25T11:01:06Z

@webknjaz: I see, thanks for outlining this scenario. So, with respect to eventually sacrificing correctness on this aspect, any outcome from discussions like python-poetry/poetry#1992, python-poetry/poetry#2184 or python-poetry/poetry#3322 might bring some improvements on this?

webknjaz · 2021-01-25T11:15:16Z

I guess... I gave up on poetry years ago. But with pip, you can do something like this by making a wheelhouse directory, and doing something like pip install --no-index -f ./wheelhouse-dir <whatever requirements you have>, and it'll only use that dir for resolution so if you cache it (maybe with pip download), you could achieve what you want.

amotl · 2021-01-25T12:03:25Z

Thanks Sviatoslav.

We will look into that and how it might be combined with Poetry. python-poetry/poetry#2873 also discusses that option. I also remembered pip --no-index and how that might be used in our case.

So, as we will need the synthesized requirements.txt file for installation anyway and we also need it to include development dependencies, maybe something along the lines of

poetry export --dev --extras=http --extras=sql --extras=excel --without-hashes --format=requirements.txt --output=requirements.txt
pip download --dest=~/.cache/wheelhouse --requirement=requirements.txt
pip download --dest=~/.cache/wheelhouse setuptools-scm wheel

When restoring the wheelhouse content into a virtualenv using

time pip install --no-index --find-links=~/.cache/wheelhouse --requirement=requirements.txt

we see it will still take 1m13s on my macOS machine. So, this might also not gain any improvements on Windows.

However, I also see that not all packages have been able to be downloaded as wheels:

l ~/.cache/wheelhouse/*.tar.gz | wc -l
      19

So, a significant amount of time was spent on building wheels from sdist packages beforehand when running pip install .... This will become obvious when running pip install a second time on a newly created virtualenv: It would only take 38s on my macOS machine now - I believe this is the best thing we can achieve here.

So, instead of only caching ~/.cache/wheelhouse, one should also cache ~/.cache/pip.

TheFriendlyCoder · 2021-01-26T01:37:12Z

Any correct resolver would need network anyway because if you request ==1.0.0, it'll also match ==1.0.0.post0, for example. And only the index would know if that exists.

Would you mind clarifying what you mean here? If I have a pegged revision as a dependency (ie: "==1.0.0") I would expect that exact pegged revision to be installed (ie: "v1.0.0") and not some alternative compatible version (ie: "v1.0.0.post0"). Perhaps you meant to use "~=1.0.0" or "^1.0.0" in your example, which would be referring to "compatible" versions and not explicitly pegged revisions.

That being said, regardless of the dependency specification that may appear in the pyproject.toml file, the resolver should not need to reach out to the remote repository except when initially generating the poetry.lock file. Once that file is generated it should contain all relevant metadata needed in order to reproduce the dependency tree without additional external resources. So then, assuming that all of the requisite dependencies are stored in the local cache, one should be able to install those dependencies mentioned in the lock file without requiring additional network IO.

Maybe I'm missing some nuance here so feel free to correct me if that is the case.

webknjaz · 2021-01-26T09:29:35Z

Would you mind clarifying what you mean here? If I have a pegged revision as a dependency (ie: "==1.0.0") I would expect that exact pegged revision to be installed (ie: "v1.0.0") and not some alternative compatible version (ie: "v1.0.0.post0"). Perhaps you meant to use "~=1.0.0" or "^1.0.0" in your example, which would be referring to "compatible" versions and not explicitly pegged revisions.

No, I described exactly how the resolvers in pip and pipenv (IIRC) work. And based on https://www.python.org/dev/peps/pep-0440/#post-releases this is by design. Maybe @pradyunsg could explain better. Another case would be having a universal wheel locally while there's a platform-specific wheel for the same index on PyPI, the resolver has no way of knowing that there's a more precise match unless it talks to the server. Besides, if a version got marked as bad and removed or yanked, the installer should know about that too.

TheFriendlyCoder · 2021-01-26T22:58:15Z

No, I described exactly how the resolvers in pip and pipenv (IIRC) work. And based on https://www.python.org/dev/peps/pep-0440/#post-releases this is by design.

The section you are referring to simply describes the format and interpretation of the post-release build identifier. That section does not imply a specific dependency resolution behaviour for a given tool.

Conversely, there is a section in PEP-440 that explains the expected behaviour for an exact / pegged version matcher that any compliant tool is expected to implement: https://www.python.org/dev/peps/pep-0440/#version-matching. This section does indicate that tools should allow for "strict" vs "non-strict" comparisons but that tools should enable strict enforcement by default. This means that if my pegged revision is "==1.0.0" and there is a 1.0.0.post1 release in the index, that it should (by default) be ignored.

However, all this being said, after the initial dependency tree has been generated (ie: while generating the poetry.lock file) the package manager should be able to precisely and unambiguously locate the exact matching package to be deployed on subsequent runs.

TheFriendlyCoder · 2021-01-26T23:04:36Z

Another case would be having a universal wheel locally while there's a platform-specific wheel for the same index on PyPI, the resolver has no way of knowing that there's a more precise match unless it talks to the server.

Similar to my previous argument, I can't see how this would lead to any change in behaviour wrt the dependency resolver. At the time the poetry.lock file is generated, the content of the lock file will be based on the current state of the pypi index. If the state of the index changes in any way then I would expect subsequent runs to deploy the exact version that is specified in the lock file, not a comparable / compatible version that may have since been added to the index.

I guess the point I'm trying to make is, any potential subtleties wrt cache management and dependency resolution should be avoidable by relying on the package metadata stored in the lock file.

TheFriendlyCoder · 2021-01-26T23:17:31Z

Besides, if a version got marked as bad and removed or yanked, the installer should know about that too.

I am pretty sure that pypi.org no longer allows maintainers to delete / yank / pull bad packages from the public repository. However, even if that were possible there are many other ways for that to break dependency resolution tools (ie: consider the case of repository mirrors) which is probably why it is prevented or strongly discouraged.

Conversely, in the case where a system has a cached dependency that has been removed from the remote repository, I would expect subsequent Poetry executions to still resolve and deploy the dependency correctly so long as that exact dependency / version is still referenced in the lock file and it can be discovered in the local cache.

In general, whether the tool should purge a package from the local cache if it has been removed from the remote repository when installing dependencies without a lock file could be debatable (I can see rationale for and against it), this need not dictate how the cache should be managed when installing dependencies from a lock file.

TheFriendlyCoder · 2021-01-26T23:19:32Z

I would suggest basing the mechanics of cache management on comparable tools from other languages like Gradle, NuGet, etc. In those cases the tools mostly allow deploying of packages from local caches, either on demand or when they are unable to connect to the remote repository.

webknjaz · 2021-02-08T12:47:24Z

This section does indicate that tools should allow for "strict" vs "non-strict" comparisons but that tools should enable strict enforcement by default. This means that if my pegged revision is "==1.0.0" and there is a 1.0.0.post1 release in the index, that it should (by default) be ignored.

I simply pointed out how things work right now without speculating about how they should work. I recommend raising issues on respective trackers of the upstream projects if you're looking for a change.

I am pretty sure that pypi.org no longer allows maintainers to delete / yank / pull bad packages from the public repository.

This is 100% incorrect. PyPI has been allowing deleting packages from the index for many years and there are no plans to disallow this (as long as a given user's role is an "owner", not just a "maintainer" — https://pypi.org/help/#collaborator-roles). One remark tho: deleting a version removes it from the available version lists but not from the CDN servers that host actual sdists and wheels. Another remark: "yank" is not the same as "delete", it's basically a special flag that tells clients like pip that is strongly discouraged to use a yanked package version but doesn't prevent them from doing so. Old pip versions ignore this flag, while the new ones try to respect it when the user requests an unpinned package version but if it's pinned, it'll go ahead and consider installing a yanked version. Also, PyPI has red badges for yanked versions, here's an example of how this looks:
https://pypi.org/project/octomachinery/#history.

And here's additional info on yanking:

Another reason for removing packages is if they contain malware/spam/namesquatting/etc per https://www.python.org/dev/peps/pep-0541/#invalid-projects.

Is using the malware from cache more important than taking into account such security issues?

The index must be used for resolution for as long as it's enabled. If you want local cache instead, there just should be a mechanism for disabling the unwanted indexes (FWIW pip already knows how to treat local dirs as indexes).

amotl · 2021-02-08T14:00:27Z

Dear all,

thank you very much for your feedback on this topic. Very much appreciated.

The index must be used for resolution for as long as it's enabled. If you want local cache instead, there just should be a mechanism for disabling the unwanted indexes (FWIW pip already knows how to treat local dirs as indexes).

Exactly. So, I conclude that we should investigate whether Poetry also offers such an option and whether it will be able to deliver respective performance gains. Otherwise, going through pip as outlined at #332 (comment) is another option which worked well.

With kind regards,
Andreas.

TheFriendlyCoder · 2021-02-09T02:05:33Z

I can see how malware being introduced into the dependency tree would be one of those extenuating circumstances where you may want to fail a deployment even when a revision is pegged. Thanks for pointing that out.

If there was at least an option or flag to disable the index lookup, that would put the user / developer back in control, which would satisfy my use case.

I hope that this will be implemented in Poetry. Juggling multiple package management tools, using pip for some things and Poetry for others, would just introduce unnecessary complexity into the tool chain.

amotl · 2021-02-09T17:59:29Z

Hi again,

it looks like discussions around this topic are going on on the Poetry issue tracker since quite some time already, see python-poetry/poetry#331 and python-poetry/poetry#559, eventually leading to python-poetry/poetry#908. Then, it goes on with python-poetry/poetry#1554 and python-poetry/poetry#1070.

Finally, the main issue tracking respective improvements seems to be python-poetry/poetry#1632, eventually leading to python-poetry/poetry#2074 by @Zyava. However, progress on this apparently has stalled and @BrandonLWhite and @mcouthon haven't been able to take over.

Now, @ZeroAurora started another attempt with python-poetry/poetry#3624 which also relates to python-poetry/poetry#3479 by @zillionare.

I wish you the best of luck to be able to finally tackle this, @ZeroAurora! It would be really nice if this patch can also bring in a feature to not use any index at all by adding a --no-index commandline option as outlined by @webknjaz at #332 (comment) like

pip install --no-index -f ./wheelhouse-dir <whatever requirements you have>

With kind regards,
Andreas.

P.S.: I've just added a comment about our use case at python-poetry/poetry#3624 (comment).

ZeroAurora · 2021-02-10T04:42:49Z

Hey everyone. Thank you for your encouragement, but this kind of issue is somewhat out of the scope of my PR. My PR focuses on those who can't connect to PyPI (or can but with a slow bandwidth). I've mentioned a probable solution in python-poetry/poetry#3624 (comment) but I'm sorry that I can't work on this since I don't have extra time and energy. I hope someone will work on this (or an offline mode).
btw, I couldn't understand why caching poetry's cache won't work 🤔

Good luck everyone! Hope this issue get resolved!

amotl · 2021-02-10T16:42:01Z

Hi @ZeroAurora,

thanks for your kind response.

This kind of issue is somewhat out of the scope of my PR.

I see. Thanks for your explanation.

Btw, I couldn't understand why caching poetry's cache won't work 🤔

It actually does, but Poetry will still reach out to the package index and this apparently adds some slowness compared to running with pip --no-index.

I hope someone will work on this (or an offline mode).

This! 👍

With kind regards,
Andreas.

amotl · 2021-02-10T22:44:56Z

@sinoroc just suggested at python-poetry/poetry#3624 (comment):

I have not exactly followed the discussion in details, but looks like you might be interested in simpleindex.

Basically, it allows you to easily start a local index server based on simple rules, and the files can be served from the local filesystem. So I guess maybe if configured right, it could do what you are after.

I answered:

simpleindex looks nice, indeed! But within a CI job like Travis CI oder GitHub Actions, even that would be additional hassle to bring up a sidecar service and wrap that around the wheelhouse directory just for that purpose. However, it might be doable. Thank you very much for your suggestion.

webknjaz · 2021-02-10T23:07:27Z

Well you could look into dumb-pypi or bandersnatch then. But it still feels like an overkill.

This was referenced Jan 24, 2021

Fix Poetry environment caching on CI/GHA/Windows #331

Merged

poetry install stalls on Windows with warm cache snok/install-poetry#18

Closed

amotl mentioned this issue Feb 9, 2021

feat: Allow using repository mirrors python-poetry/poetry#3624

Closed

7 tasks

amotl changed the title ~~Improve speed on CI~~ Improve speed of Poetry on CI Feb 10, 2021

gutzbenj closed this as completed Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of Poetry on CI #332

Improve speed of Poetry on CI #332

amotl commented Jan 24, 2021 •

edited

Loading

amotl commented Jan 24, 2021

amotl commented Jan 24, 2021 •

edited

Loading

sondrelg commented Jan 24, 2021

amotl commented Jan 25, 2021

webknjaz commented Jan 25, 2021

amotl commented Jan 25, 2021

webknjaz commented Jan 25, 2021

amotl commented Jan 25, 2021

TheFriendlyCoder commented Jan 26, 2021

webknjaz commented Jan 26, 2021 •

edited

Loading

TheFriendlyCoder commented Jan 26, 2021

TheFriendlyCoder commented Jan 26, 2021

TheFriendlyCoder commented Jan 26, 2021

TheFriendlyCoder commented Jan 26, 2021

webknjaz commented Feb 8, 2021

amotl commented Feb 8, 2021 •

edited

Loading

TheFriendlyCoder commented Feb 9, 2021

amotl commented Feb 9, 2021 •

edited

Loading

ZeroAurora commented Feb 10, 2021 •

edited

Loading

amotl commented Feb 10, 2021

amotl commented Feb 10, 2021

webknjaz commented Feb 10, 2021

Improve speed of Poetry on CI #332

Improve speed of Poetry on CI #332

Comments

amotl commented Jan 24, 2021 • edited Loading

amotl commented Jan 24, 2021

amotl commented Jan 24, 2021 • edited Loading

sondrelg commented Jan 24, 2021

amotl commented Jan 25, 2021

webknjaz commented Jan 25, 2021

amotl commented Jan 25, 2021

webknjaz commented Jan 25, 2021

amotl commented Jan 25, 2021

TheFriendlyCoder commented Jan 26, 2021

webknjaz commented Jan 26, 2021 • edited Loading

TheFriendlyCoder commented Jan 26, 2021

TheFriendlyCoder commented Jan 26, 2021

TheFriendlyCoder commented Jan 26, 2021

TheFriendlyCoder commented Jan 26, 2021

webknjaz commented Feb 8, 2021

amotl commented Feb 8, 2021 • edited Loading

TheFriendlyCoder commented Feb 9, 2021

amotl commented Feb 9, 2021 • edited Loading

ZeroAurora commented Feb 10, 2021 • edited Loading

amotl commented Feb 10, 2021

amotl commented Feb 10, 2021

webknjaz commented Feb 10, 2021

amotl commented Jan 24, 2021 •

edited

Loading

amotl commented Jan 24, 2021 •

edited

Loading

webknjaz commented Jan 26, 2021 •

edited

Loading

amotl commented Feb 8, 2021 •

edited

Loading

amotl commented Feb 9, 2021 •

edited

Loading

ZeroAurora commented Feb 10, 2021 •

edited

Loading