pip metadata refactoring #680

slimreaper35 · 2024-10-09T07:16:14Z

My local approximate results:

(venv) ~/cachi2 (main) $ time tox -e py312

real    0m24.600s
user    0m23.449s
sys     0m1.419s

(venv) ~/cachi2 (pip-refactoring) $ time tox -e py312

real    0m13.625s
user    0m12.713s
sys     0m0.997s

Maintainers will complete the following section

Commit messages are descriptive enough
Code coverage from testing does not decrease and new code is covered
Docs updated (if applicable)
Docs links in the code are still valid (if docs were updated)

Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:

approve GitHub Actions workflows by clicking a button
approve the Red Hat Trusted App Pipeline container build by commenting /ok-to-test
(as is the standard for Pipelines as Code)

a-ovchinnikov

LGTM with some minor nitpicks.

cachi2/core/package_managers/pip.py

eskultety · 2024-10-10T13:20:25Z

There's too much going on in this single commit, so it's difficult to follow all the changes in the diff, please introduce them gradually.

eskultety

Very much in favour of this work, needs some polishing though.

eskultety · 2024-10-10T13:25:19Z

cachi2/core/package_managers/pip.py

-                ),
-                docs=PIP_METADATA_DOC,
-            )
+    return None, None


This refactor isn't a 1:1 replacement - previously if a name was found it was carried over to other source checks. Here, if you extract name but can't find version your code will resort to returning (None,None) which means we'll infer the name of the project from the URL instead, that doesn't sound right.

cachi2/core/package_managers/pip.py

eskultety · 2024-10-10T13:28:15Z

cachi2/core/package_managers/pip.py

-    First, try to parse the setup.py script (if present) and extract name and version
-    from keyword arguments to the setuptools.setup() call. If either name or version
-    could not be resolved and there is a setup.cfg file, try to fill in the missing
-    values from metadata.name and metadata.version in the .cfg file.
+    repo_name = Path(repo_id.parsed_origin_url.path.removesuffix(".git")).name
+    subpath = package_dir.subpath_from_root
+    resolved_name = Path(repo_name).joinpath(subpath)
+    return canonicalize_name(str(resolved_name).replace("/", "-")).strip("-.")


This extraction should be a separate commit. Actually 2:

one to perform the extraction

one to rearrange the logic

nitpick: While at it, you can get rid of removesuffix and use stem from Path instead of name.

I would like to not rearrange the commits in this PR at all + don't waste too much time here. Each commit would require modification in unit tests, and that is not trivial based on the extremely complicated logic in there. I chose a much easier path, so, delete everything and write this from scratch.

I know, in GitHub the PR is basically unreviewable. Sorry for that.

I know, in GitHub the PR is basically unreviewable. Sorry for that.

First things first, GH has nothing to do with this, I review stuff locally with 50 lines of context usually and even with flags to git to detect moved code doesn't change the situation in any way.

I would like to not rearrange the commits in this PR at all + don't waste too much time here. Each commit would require modification in unit tests, and that is not trivial based on the extremely complicated logic in there. I chose a much easier path, so, delete everything and write this from scratch.

The only blocker for splitting here is the unit tests here, I agree. But given that this refactor actually changes the behavioral logic such that we don't return the same results as previously from the pip_metadata helper function. Additionally, the tests that you introduce actually don't test the compound cases which is the whole point of the pip metadata querying logic (e.g. name comes from one project file, version from another one). Since it looks like you'll have to rework the unit tests quite a bit that gives you the opportunity to split the changes into commits - I have actually tried myself and was pretty straightforward except for the tests, but like I said, looks like you'll have to rework that.

cachi2/core/package_managers/pip.py

eskultety

After having carefully gone through the unit tests which I didn't do in my first round of reviews I think we're actually opening us up for potential issues with pyproject.toml setup.py etc. mixed metadata.
I think that while we may cosmetically change the code and break the logic into smaller helper functions, we'll have to test the metadata querying in the compound way we're doing now.

eskultety · 2024-10-11T12:59:53Z

cachi2/core/package_managers/pip.py

+    # setup.py
+    if setup_py.exists():
+        log.debug("Checking setup.py for metadata")
+        name = setup_py.get_name()
+        version = setup_py.get_version()
+
+        if name and version:
+            return name, version
+
+    # setup.cfg
+    if setup_cfg.exists():
+        log.debug("Checking setup.cfg for metadata")
+        name = setup_cfg.get_name()
+        version = setup_cfg.get_version()
+
+    return name, version


This is still not a 1:1 refactor and I'd argue this is a breaking change in behaviour (although in rare cases, but it is) - if you can't infer version from e.g. pyproject.toml you try setup.py, but if setup.py only defines version, but not name, you'll overwrite the name to None and eventually fall back to infering the name from the repo URL identifier which is not what the previous behaviour did and I'm not sure we want this to be changed.

tests/unit/package_managers/test_pip.py

eskultety · 2024-10-11T13:50:34Z

cachi2/core/package_managers/pip.py

-    First, try to parse the setup.py script (if present) and extract name and version
-    from keyword arguments to the setuptools.setup() call. If either name or version
-    could not be resolved and there is a setup.cfg file, try to fill in the missing
-    values from metadata.name and metadata.version in the .cfg file.
+    repo_name = Path(repo_id.parsed_origin_url.path.removesuffix(".git")).name
+    subpath = package_dir.subpath_from_root
+    resolved_name = Path(repo_name).joinpath(subpath)
+    return canonicalize_name(str(resolved_name).replace("/", "-")).strip("-.")


I know, in GitHub the PR is basically unreviewable. Sorry for that.

First things first, GH has nothing to do with this, I review stuff locally with 50 lines of context usually and even with flags to git to detect moved code doesn't change the situation in any way.

I would like to not rearrange the commits in this PR at all + don't waste too much time here. Each commit would require modification in unit tests, and that is not trivial based on the extremely complicated logic in there. I chose a much easier path, so, delete everything and write this from scratch.

The only blocker for splitting here is the unit tests here, I agree. But given that this refactor actually changes the behavioral logic such that we don't return the same results as previously from the pip_metadata helper function. Additionally, the tests that you introduce actually don't test the compound cases which is the whole point of the pip metadata querying logic (e.g. name comes from one project file, version from another one). Since it looks like you'll have to rework the unit tests quite a bit that gives you the opportunity to split the changes into commits - I have actually tried myself and was pretty straightforward except for the tests, but like I said, looks like you'll have to rework that.

slimreaper35 · 2024-10-14T08:59:20Z

After having carefully gone through the unit tests which I didn't do in my first round of reviews I think we're actually opening us up for potential issues with pyproject.toml setup.py etc. mixed metadata.

That's a good point. The fact that we were mixing metadata from multiple project configuration files was there reason why we ended up with extremely complicated and long unit tests. Splitting name and version into multiple configuration files makes no sense on its own. In the end, we only need the name, one string, for an SBOM component.

More changes have accumulated, must take another look

eskultety · 2024-10-17T11:33:13Z

After having carefully gone through the unit tests which I didn't do in my first round of reviews I think we're actually opening us up for potential issues with pyproject.toml setup.py etc. mixed metadata.

That's a good point. The fact that we were mixing metadata from multiple project configuration files was there reason why we ended up with extremely complicated and long unit tests. Splitting name and version into multiple configuration files makes no sense on its own. In the end, we only need the name, one string, for an SBOM component.

Well, what this PR just did is a breaking change from the behaviour POV without any warning. There probably was a reason we did this way in the past. It's true that mixing metadata is wrong, however, we allowed it and it also wasn't against the ecosystem practices, was it? (although very unexpected without a doubt). So this can't be compared to our recent dropping of Go vendoring flags, because those actually allowed projects to use incorrect repo setups which would not be buildable using standard toolkits the way users intended to in the first place, I'm not sure that's the case here.

If we end up wanting this, then you'll have to accompany this change with a docs update (we'll also need to mention that in the release notes). That said, although I'm definitely not a fan of breaking backwards compatibility, strictly speaking SemVer [1]:

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

and so I won't stop this work based on this argument, but we'll probably need more voices in favour.

[1] https://semver.org/#semantic-versioning-specification-semver

eskultety

You still stuffed everything into commit 1. The changes can be introduced gradually by adding one unit test at a time and turning off that particular test area in the more complex unit test you're trying to kill. That way, you'd keep most of the things as is until you're ready to switch and then remove everything you don't need in a single commit, it can be done and the diff will be much more readable IMO. I'm not fond of trying to argument squashed changes by a complex unit test that isn't easily to be replaced (as I mentioned one option how to do it) as a justification - things can be made cleaner for the reader/reviewer.

eskultety · 2024-10-17T11:24:38Z

cachi2/core/package_managers/pip.py

@@ -460,14 +460,6 @@ def get_version(self) -> Optional[str]:
            log.warning("No project.version in pyproject.toml")
            return None

-    def check_dynamic_version(self) -> bool:


You don't explain anywhere why you're dropping this check.

Can I use similar reasoning?

People should be aware of setup.py soft deprecation by now, do we want to hold everyone's hand? I mean displaying a warning for users who genuinely need setup.py (because pyproject.toml simply doesn't cut it for them as they may have C deps) doesn't feel right. I wouldn't strictly argue against having a warning if you proposed it somewhere in the code, I'm just questioning the usefulness given the circumstances.

Even though it is not deprecated. But the version is optional so I don't see much value in the warning message.

slimreaper35 · 2024-10-17T12:09:05Z

You still stuffed everything into commit 1. The changes can be introduced gradually by adding one unit test at a time and turning off that particular test area in the more complex unit test you're trying to kill. That way, you'd keep most of the things as is until you're ready to switch and then remove everything you don't need in a single commit, it can be done and the diff will be much more readable IMO. I'm not fond of trying to argument squashed changes by a complex unit test that isn't easily to be replaced (as I mentioned one option how to do it) as a justification - things can be made cleaner for the reader/reviewer.

I'll try my best

slimreaper35 · 2024-10-17T13:29:53Z

Also, it might be worth discussing setup.py as:

New projects are advised to avoid setup.py configurations (beyond the minimal stub) when custom scripting during the build is not necessary. Examples are kept in this document to help people interested in maintaining or contributing to existing packages that use setup.py. Note that you can still keep most of configuration declarative in setup.cfg or pyproject.toml and use setup.py only for the parts not supported in those files (e.g. C extensions). See note.

We can at least add a warning when extracting metadata from setup.py

eskultety · 2024-10-18T08:21:10Z

We can at least add a warning when extracting metadata from setup.py

People should be aware of setup.py soft deprecation by now, do we want to hold everyone's hand? I mean displaying a warning for users who genuinely need setup.py (because pyproject.toml simply doesn't cut it for them as they may have C deps) doesn't feel right. I wouldn't strictly argue against having a warning if you proposed it somewhere in the code, I'm just questioning the usefulness given the circumstances.

eskultety · 2024-10-21T09:55:24Z

After having carefully gone through the unit tests which I didn't do in my first round of reviews I think we're actually opening us up for potential issues with pyproject.toml setup.py etc. mixed metadata.

That's a good point. The fact that we were mixing metadata from multiple project configuration files was there reason why we ended up with extremely complicated and long unit tests. Splitting name and version into multiple configuration files makes no sense on its own. In the end, we only need the name, one string, for an SBOM component.

Well, what this PR just did is a breaking change from the behaviour POV without any warning. There probably was a reason we did this way in the past. It's true that mixing metadata is wrong, however, we allowed it and it also wasn't against the ecosystem practices, was it? (although very unexpected without a doubt). So this can't be compared to our recent dropping of Go vendoring flags, because those actually allowed projects to use incorrect repo setups which would not be buildable using standard toolkits the way users intended to in the first place, I'm not sure that's the case here.

If we end up wanting this, then you'll have to accompany this change with a docs update (we'll also need to mention that in the release notes). That said, although I'm definitely not a fan of breaking backwards compatibility, strictly speaking SemVer [1]:

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

and so I won't stop this work based on this argument, but we'll probably need more voices in favour.

[1] https://semver.org/#semantic-versioning-specification-semver

@brunoapimentel @a-ovchinnikov @taylormadore @ben-alkov any opinions on simpler yet backwards incompatible behaviour?

a-ovchinnikov · 2024-10-21T10:55:45Z

any opinions on simpler yet backwards incompatible behaviour?

Original behaviour looks somewhat strange to me -- I would think that a package which has name defined in one config, and version in another is malformed and will cause other issues as well. While possible I don't believe it is probable to find such a package. I am generally in favor of making a live test.

The change breaks the original behavior, but I am not sure it was correct to begin with. With the code as we had it before we stopped at the first found pair of name and version, but technically every location could have defined its own name and version, so a sequence of

name    version
----------------
foo     None
bar     1.0.0
baz     2.3.4

would have resulted in (foo, 1.0.0) and there is no good way of telling if it is the correct (name, version) rather than (baz, 2.3.4). Personally I would have rejected a package that has mismatching names or versions in its definition, or at least emitted a big warning.

This change makes the code a little cleaner so I am in favor of it.
I would suggest capturing the essence of this discussion and adding a big comment to the new extractor to explain that we don't want to deal with heterogeneous (name, version) pair. In case we find that this was a wrong decision we'll update both the code and the comment and won't need to figure that out ever again.

Signed-off-by: Michal Šoltis <[email protected]>

cachi2/core/package_managers/pip.py

There is no context within the log warning. We don't warn users about other things when parsing package metadata (for example deprecation of setup.py). The version is an optional attribute in the SBOM. Even cachi2 uses "dynamic version". Signed-off-by: Michal Šoltis <[email protected]>

The commit follows the previous one, that drops a warning when processing metadata from pyproject.toml. This piece of code is no longer needed. Signed-off-by: Michal Šoltis <[email protected]>

Signed-off-by: Michal Šoltis <[email protected]>

…function Signed-off-by: Michal Šoltis <[email protected]>

Do not mix name and version from multiple config files (pyproject.toml, setup.cfg, setup.py) and with the name from git origin remote. Drastically simplify unit tests and speed up overall time while preserving the same coverage. Signed-off-by: Michal Šoltis <[email protected]>

slimreaper35 requested review from a-ovchinnikov, eskultety and ben-alkov October 9, 2024 07:16

a-ovchinnikov approved these changes Oct 9, 2024

View reviewed changes

cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved

cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved

slimreaper35 force-pushed the pip-refactoring branch from 414e227 to 59e4f7e Compare October 9, 2024 15:55

a-ovchinnikov reviewed Oct 9, 2024

View reviewed changes

cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved

a-ovchinnikov reviewed Oct 9, 2024

View reviewed changes

cachi2/core/package_managers/pip.py Show resolved Hide resolved

slimreaper35 force-pushed the pip-refactoring branch from 59e4f7e to c66f55b Compare October 10, 2024 11:46

eskultety reviewed Oct 10, 2024

View reviewed changes

slimreaper35 force-pushed the pip-refactoring branch from c66f55b to af59166 Compare October 10, 2024 14:41

slimreaper35 requested review from a-ovchinnikov and eskultety October 10, 2024 14:41

a-ovchinnikov previously approved these changes Oct 10, 2024

View reviewed changes

eskultety reviewed Oct 11, 2024

View reviewed changes

slimreaper35 force-pushed the pip-refactoring branch from af59166 to fa86d7b Compare October 14, 2024 19:15

eskultety reviewed Oct 17, 2024

View reviewed changes

slimreaper35 force-pushed the pip-refactoring branch 2 times, most recently from 1ef1462 to b24a389 Compare October 25, 2024 11:28

pip: Make 'extracting package name' from origin URL a separate function

30c367c

Signed-off-by: Michal Šoltis <[email protected]>

slimreaper35 force-pushed the pip-refactoring branch 2 times, most recently from d64087f to 84a0060 Compare October 25, 2024 11:32

slimreaper35 requested review from eskultety and a-ovchinnikov October 25, 2024 11:43

a-ovchinnikov approved these changes Oct 25, 2024

View reviewed changes

cachi2/core/package_managers/pip.py Outdated Show resolved Hide resolved

slimreaper35 added 5 commits October 25, 2024 17:44

pip: Drop pyproject.toml dynamic version code

5c70455

The commit follows the previous one, that drops a warning when processing metadata from pyproject.toml. This piece of code is no longer needed. Signed-off-by: Michal Šoltis <[email protected]>

pip: Improve logging when parsing package metadata

78fd2d2

Signed-off-by: Michal Šoltis <[email protected]>

pip: Make 'extracting package metadata' from config files a separate …

d24c54c

…function Signed-off-by: Michal Šoltis <[email protected]>

slimreaper35 force-pushed the pip-refactoring branch from 84a0060 to 043ac2b Compare October 25, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pip metadata refactoring #680

pip metadata refactoring #680

slimreaper35 commented Oct 9, 2024 •

edited by a-ovchinnikov

Loading

a-ovchinnikov left a comment

eskultety commented Oct 10, 2024

eskultety left a comment

eskultety Oct 10, 2024

eskultety Oct 10, 2024

slimreaper35 Oct 10, 2024

eskultety Oct 11, 2024

eskultety left a comment

eskultety Oct 11, 2024

eskultety Oct 11, 2024

slimreaper35 commented Oct 14, 2024

eskultety commented Oct 17, 2024 •

edited

Loading

eskultety left a comment

eskultety Oct 17, 2024

slimreaper35 Oct 21, 2024

slimreaper35 commented Oct 17, 2024

slimreaper35 commented Oct 17, 2024

eskultety commented Oct 18, 2024 •

edited

Loading

eskultety commented Oct 21, 2024

a-ovchinnikov commented Oct 21, 2024

pip metadata refactoring #680

Are you sure you want to change the base?

pip metadata refactoring #680

Conversation

slimreaper35 commented Oct 9, 2024 • edited by a-ovchinnikov Loading

Maintainers will complete the following section

a-ovchinnikov left a comment

Choose a reason for hiding this comment

eskultety commented Oct 10, 2024

eskultety left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskultety left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slimreaper35 commented Oct 14, 2024

eskultety commented Oct 17, 2024 • edited Loading

eskultety left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slimreaper35 commented Oct 17, 2024

slimreaper35 commented Oct 17, 2024

eskultety commented Oct 18, 2024 • edited Loading

eskultety commented Oct 21, 2024

a-ovchinnikov commented Oct 21, 2024

slimreaper35 commented Oct 9, 2024 •

edited by a-ovchinnikov

Loading

eskultety commented Oct 17, 2024 •

edited

Loading

eskultety commented Oct 18, 2024 •

edited

Loading