Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RFC7617-compliant multi-domain basic authentication #10904

Closed
wants to merge 1 commit into from

Conversation

potiuk
Copy link
Contributor

@potiuk potiuk commented Feb 13, 2022

The https://datatracker.ietf.org/doc/html/rfc7617#section-2.2
defines multi-domain authentication behaviour and authentication
scopes for basic authentication. This change improves the
implementation of the multi-domain matching to be RC7617 compliant

  • path matching (including longest match)
  • scheme validation matching

Closes: #10902

@potiuk potiuk marked this pull request as draft February 13, 2022 18:12
@potiuk
Copy link
Contributor Author

potiuk commented Feb 13, 2022

Apologies if I did not follow all the necessary information - I submitted it as a draft to facilitate #10902 discussion - happy to add whatever is needed to make it fully compliant with the requirements after we decide in the discussion if this is valid change or not.

@potiuk
Copy link
Contributor Author

potiuk commented Feb 14, 2022

In the latest fixups I've also added a missing test for the 401 case - I am mocking user interaction with the user.

@potiuk potiuk marked this pull request as ready for review February 14, 2022 10:43
.pre-commit-config.yaml Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the fix-behaviour-of-auth-information branch from 3eca90a to b8c1120 Compare February 26, 2022 10:14
@potiuk
Copy link
Contributor Author

potiuk commented Feb 26, 2022

I addressed all doc build and static check failures and rebased to latest main.

@potiuk
Copy link
Contributor Author

potiuk commented Feb 26, 2022

I would love to get the workflows approved to see if all is cool.

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits, this should work.

docs/html/topics/authentication.md Show resolved Hide resolved
src/pip/_internal/utils/misc.py Outdated Show resolved Hide resolved
@potiuk
Copy link
Contributor Author

potiuk commented Mar 2, 2022

I thnk there was a timeout on some windows test but it seems like a flake. Let me close/reopen to rebuild it (but I believe it will need to get workflow re-approved).

@potiuk potiuk closed this Mar 2, 2022
@potiuk potiuk reopened this Mar 2, 2022
@potiuk
Copy link
Contributor Author

potiuk commented Mar 3, 2022

Seems like all tests passed.

@potiuk
Copy link
Contributor Author

potiuk commented Mar 11, 2022

And chances to merge it ?

@potiuk
Copy link
Contributor Author

potiuk commented Mar 26, 2022

Hello, are we going to merge it?

@pradyunsg
Copy link
Member

@potiuk Could you rebase this PR with --autosquash?

@potiuk
Copy link
Contributor Author

potiuk commented Apr 11, 2022

Sure. No problem.

@potiuk potiuk force-pushed the fix-behaviour-of-auth-information branch from de43b07 to d0d2af6 Compare April 11, 2022 13:22
@potiuk
Copy link
Contributor Author

potiuk commented Apr 11, 2022

Done.

@potiuk potiuk force-pushed the fix-behaviour-of-auth-information branch from d0d2af6 to 5d8a1fa Compare April 11, 2022 13:52
@potiuk
Copy link
Contributor Author

potiuk commented Apr 11, 2022

Fixed static checks too with pre-commit. Rebase does not run pre-commit by-default unfortunately (only when there is a conflict).

@pradyunsg
Copy link
Member

This is now failing test_prioritize_url_credentials_over_netrc consistently.

@potiuk
Copy link
Contributor Author

potiuk commented Apr 11, 2022

Yep. Looking at it - I guess it's because of the recent changes in netrc prioritization - the test was added yesterday eacc739

@potiuk potiuk force-pushed the fix-behaviour-of-auth-information branch from 5d8a1fa to d15d5f7 Compare April 11, 2022 21:40
@potiuk
Copy link
Contributor Author

potiuk commented Apr 11, 2022

Right - the problem was that the test made wrong assumption about non-RFC7617 compliant behaviour:

Before the RFC7617 fix:

  • Initially the call was made to https://USERNAME:PASSWORD@{server.host}:{server.port}/simple and USERNAME and password were cached for "server.host"
  • the call returned the index of files to download /files/simple-3.0.tar.gz
  • downloading the /files/simple-3.0-tar.gz re-used USERNAME/PASSWORD when downloaded

After the RFC-compliant change:

  • Initially the call was made to https://USERNAME:PASSWORD@{server.host}:{server.port}/simple and USERNAME and password were cached for "server.host/simple"
  • the call returned the index of files to download /files/simple-3.0.tar.gz host
  • downloading the /files/simple-3.0-tar.gz did not use USERNAME:PASSWORD (It should not according to the RFC)
  • retry mechanism kicked-in
  • the mock server exhausted side-effects and caused 500 error every next retry

The fix was to make the "/files/" to be "/simple/files" which makes proper (RFC7617-compliant) authentication.

I believe (please correct me if I am wrong) this is in-line with https://peps.python.org/pep-0503/:

All subsequent URLs in this document will be relative to this base URL (so given PyPI’s URL, a URL of /foo/ would be https://pypi.org/simple/foo/.

I think the "index" in simple should return only URLs starting with "/simple".

Likely (If I am right) other tests in this file could be also corrected to follow the corect "/simple/files/" pattern but I think it is not part of this PR.

@potiuk
Copy link
Contributor Author

potiuk commented Apr 12, 2022

One worry here when I thought about it (and it's worth to make a conscious decision):

I think this might potentially break some repos that are not exactly https://peps.python.org/pep-0503 compliant. If someone actually implemented those repos with the same "mistake" that the one who created the tests did (i.e. exposed the files directly at / file of the server, while follow the "/simple/" prefix approach from the PEP and expected the authentication to be reused for those /* downloads).

There is this comment there which indicates that the implementation might host files anywhere:

There are no constraints on where the files must be hosted relative to the repository.

PyPI itself hosts the files with completely unrelated URLs. And many repositories might do the same (in which case the user/password authentication might not be at all used and likely the file urls generated are already pre-authenticated/signed or contain authentication information).

Even previously, before my fix, the user/password would only be "reused" (accidentally as it is not standard-compliant) to authenticate to download files when the files are hosted at the same URL. But that was not stated anywhere (and is against the RFC7617 for basic authentication).

So there is a small risk that some implementations accidentally rely on this (even if it is against the standards).
I think it has a "breaking" potential, but also it follows the recent history of pip to enforce standards and be a little "brave" in breaking the full compatibility if the 3rd parties are not following the established industry standards, so I think this should be perfectly fine.

But I leave to maintainers to decide (knowing the risks). Maybe we should add a flag that "restores" previous behaviour just in case?

@pfmoore
Copy link
Member

pfmoore commented Apr 12, 2022

IMO we have to allow for URLs outside of those which are specified by the simple index API (such as files) to be served from any path on the server - as you noted, PEP 503 explicitly allows for this possibility with the comment

There are no constraints on where the files must be hosted relative to the repository.

In particular, I don't think it's OK to require users to specify a flag to allow that behaviour - there's no reason to assume that the user even knows the URL structure of the server, or even that it won't change (if the server changes its layout, why should the user have to add a flag?)

@potiuk
Copy link
Contributor Author

potiuk commented Apr 12, 2022

IMO we have to allow for URLs outside of those which are specified by the simple index API (such as files) to be served from any path on the server - as you noted, PEP 503 explicitly allows for this possibility with the comment

They are allowed, it's just the authentication fthat is specified wiht "/simple" does not carry now to anything hosted with "/" - which is the "right" way according to internet standard of basic authentication. The PEP 503 does not mention anything about authentication, so it's not "against" PEP 503. However the previous authentication approach was not conforming to the RFC standard for basic authentication that far predates any PEP.

So if anyone served files from "/" - where the URL they used were https://user:password/URL/simple/ and expected authentication to work - this was never supposed to work (unless additional user/password was specified also for https://URL/ (which might be enough of escape hatch). Assuming that user:password specified for https://user:password/URL/simple/ will work for https://URL/ is just wrong. It should not (according to RFC7617).

I am not at all sure if anyone was doing it - if they did, this worked because the authentication was not following the well established and security-reaoned standard, so if we decide to add a flag, this is more of an escape hatch until someone fixes the problem,

But I am also ok with not providing the flag but simply switching pip to follow standards if you will to take the (small I think) risk.

@potiuk
Copy link
Contributor Author

potiuk commented Apr 12, 2022

What do you propose then @pfmoore ?

@potiuk
Copy link
Contributor Author

potiuk commented Apr 12, 2022

Actually - I jus realized, that there is no need for an escape hatch. If you really hit the problem, you can always (as expected with the standard) add user/password for "https://URL" in pip.conf (or use other ways pip supports).

It does not have to be carried from the "/simple" URL. So there is no need to add any switch, but maybe explaining in the docs what to do if you hit this problem. (I am just trying to anticipate and prepare for any questions users might have).

I am happy to add docs for it. WDYT?

@uranusjr
Copy link
Member

I’d say the situation you described is just a bug and should not have worked in the first place, so describing this in the documentation (with a link from the changelog perhaps) should be appropriate.

@pfmoore
Copy link
Member

pfmoore commented Apr 12, 2022

What do you propose then @pfmoore ?

I do not consider myself a security expert, so I have no opinion on this.

@potiuk potiuk force-pushed the fix-behaviour-of-auth-information branch from d15d5f7 to d643123 Compare April 12, 2022 16:00
@potiuk
Copy link
Contributor Author

potiuk commented Apr 12, 2022

I pushed a documentation update to explain it. I am not sure if I should add Changelog entry myself or leave it up to the relase manager?

@potiuk potiuk force-pushed the fix-behaviour-of-auth-information branch from d643123 to e5e9e0f Compare April 12, 2022 16:03
The https://datatracker.ietf.org/doc/html/rfc7617#section-2.2
defines multi-domain authentication behaviour and authentication
scopes for basic authentication. This change improves the
implementation of the multi-domain matching to be RC7617 compliant

* path matching (including longest match)
* scheme validation matching

Closes: pypa#10902
@potiuk
Copy link
Contributor Author

potiuk commented Apr 22, 2022

Any thoughts here - I know @pradyunsg have some compatibility concerns and I share them. I think what we might do is to add an exception to "/simple/" prefix. We could treat it in non-RFC compliant way. The risk connected with that is very low but if someone follows theh "copy" of standard pypi approach but with authentication - where "/simple" '/*" share the same authentication info.

@potiuk
Copy link
Contributor Author

potiuk commented May 6, 2022

Any chance this one will make it in the upcoming release ? Seems that there is at least one other issue #10806 that is triggered by this non-compliant behaviour and it seems merging that one would solve the issue (I looked briefly but it seems that the problem #10806 is triggered by non-RFC7617 compliant auth reuse.

@q0w
Copy link
Contributor

q0w commented May 6, 2022

@potiuk could you pls also check this #10979 (comment), I think I've made useless change in that pr, and with RFC-compliant urls it works out of the box.

@potiuk
Copy link
Contributor Author

potiuk commented May 6, 2022

@potiuk could you pls also check this #10979 (comment), I think I've made useless change in that pr, and with RFC-compliant urls it works out of the box.

I believe your change fixing #10979 was useful regardless of the RFC7617 change. The problem you described would still not be solved in this case. If you added .netrc auth at the base "URL" and .netrc would be prioritized over URL credentials, even with RFC7617 change it would work the same. RFC7617 states that authentication for "http://my-url" is also valid for "http://my-url/project/path" and (if I understand it correctly) the problem #10797 was that the .netrc configuration was used first when getting credentials to download file (and without your change it would be the same).

However after looking in detail of your log from artifactory, I think I will have to update the fix - see the more detailed description of the worry I had #10904 (comment) and also @pradyunsg was right to be a little reserved about it.

It seems actually artifactory implements what I was afraid of a bit. The "/simple" suffix is not treated as a real part of the repository URL and makes the implementation "nearly RFC7617 compliant":

First the meta-data downloading is done with

And then the file is downloaded using:

This means that the authentication from the first request will NOT be reused in the second. The "my-pypi-repo/simple-package/" is different authentication scope than "my-pypi-repo/simple/simple-package/".

The problem here is that artifactory does not treat the first authentication request as "authentication scope". They act as if the "https://us-python.pkg.dev/$PROJECT/my-pypi-repo/" is the "authentication scope" of the first request. Even if the first URL to call is really "https://us-python.pkg.dev/$PROJECT/my-pypi-repo/simple/simple-package".

And actually - what artifactory does, makes sense. The authentication scope SHOULD be https://us-python.pkg.dev/$PROJECT/my-pypi-repo/ - this is the scope where you specify credentials. But currently we have no good way to handle this - because we just see the first URL requested by pip (which contains /simple/ prefix).

The confusion is caused - I think by the "/simple" suffix which is used by PyPI but on one hand it is not mandatory but on the other hand it is widely used.

I looked at artifactory docs and well .. there is a lot of explanation on when you should use simple and when not . They seem to be confused too ... https://www.jfrog.com/confluence/display/JFROG/PyPI+Repositories. This comes from all the ( I think welll known) ambiguity around "/simple" coming from https://peps.python.org/pep-0503/.

We all know that .pypi rc index-url contains full URL including /simple, but pip search uses --index which points to webserver (and does not use "/simple"). This is not a problem on its own, but it makes it difficult to derive the actual "authentication scope" for the requests.

I think there is no "perfect" simple solution because of how the "/simple" prefix can be interpreted by different implementations of PyPI repository On one hand it is used by PIP/PYPI and different implementation could have mimicked it (as jfrog did), on the other hand the real "authentication scope" is the one that --index prefix should normally point to and PEP0503 does not make any assertions about that. It could be for example (theoretically) that --index-url points to "somerepo/subdir/simple" and --index (i.e. root of the repo) could be "somerepo/subdir/index" for example. In this case the "authentication scope" *SHOULD be "somerepo/subdir" rather than "somerepo/subdir/simple"- but we have no easy way to know that.

I think we can only try to guess currently. The current "guess" that is done by PyPI is that authentication scope is "domain name" (and also it does not include schema - http and https are equivalent). But it does not handle well the case where one domain can host multiple repos with different authentication (which is the jfrog case actually)

My change moves the authentication scope to "somerepo/subdir/simple" which is also wrong - in case when someone (and again jfrog ACTUALLY did it) assumes and uses the fact that the real authentication scope is "somerepo/subdir".

I think we simply should get better at guessing (because we have no way to know for sure what is the real "authentication scope").

My initial proposal is that we check if the first URL path ends with "/simple" and if so - we could set "authentication scope" to be the prefix of "/simple". Otherwise we assume full URL is the "authentication scope".

I think this is not "perfect" but "good enough". For example artifactory allows (!!!) to change the suffix ..... Yeah. You can actually change "/simple" to something else :( . And we have no way of knowing that.

Ideal fix (and maybe we could tackle it together or leave for later) should be that we should be able to specify the "suffix" in PyPI as well.

@pradyunsg @uranusjr WDYT?

@potiuk potiuk marked this pull request as draft May 6, 2022 19:10
@potiuk
Copy link
Contributor Author

potiuk commented May 6, 2022

Converted to Draft to avoid accidental merging.

@q0w
Copy link
Contributor

q0w commented May 6, 2022

@potiuk
I mean, this test should not be failing (but fails) - get credentials from .netrc, if there is not in url

def test_netrc(
    script: PipTestEnvironment,
    data: TestData,
    cert_factory: CertFactory,
) -> None:
    cert_path = cert_factory()
    ctx = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
    ctx.load_cert_chain(cert_path, cert_path)
    ctx.load_verify_locations(cafile=cert_path)
    ctx.verify_mode = ssl.CERT_REQUIRED

    server = make_mock_server(ssl_context=ctx)
    server.mock.side_effect = [
        package_page(
            {
                "simple-3.0.tar.gz": "/files/simple-3.0.tar.gz",
            }
        ),
        authorization_response(str(data.packages / "simple-3.0.tar.gz")),
    ]

    url = f"https://{server.host}:{server.port}/simple"

    netrc = script.scratch_path / ".netrc"
    netrc.write_text(
        f"machine {server.host} login USERNAME password PASSWORD"
    )
    with server_running(server):
        script.environ["NETRC"] = netrc
        script.pip(
            "install",
            "--no-cache-dir",
            "--index-url",
            url,
            "--cert",
            cert_path,
            "--client-cert",
            cert_path,
            "simple",
        )
        script.assert_installed(simple="3.0")

@pradyunsg
Copy link
Member

pradyunsg commented May 14, 2022

I think one of the things we can do is... to not go about fixing this in one go. Rather, let's treat this as a potentially-disruptive change that we give opt-in and opt-outs for. With that, the rollout of this would look like:

Release X:

  • Add logic for getting credentials in a RFC 7617 compliant manner.
  • Add logic to present a deprecation warning when the credential value that the current logic uses, is not aligned with what the RFC would bring up.
  • Provide an opt-in --use-feature=better-auth-handling or something.
  • We publicise this change, ensuring that we let users know that we recommend that they should opt-in to this. This will likely need someone to write a detailed and clear communication about what this affects, how to opt-in and more.

Release X+2:

  • Change the default behaviour to use the RFC 7617 compliant authentication values. The opt-in flag becomes a no-op right now.
  • Add a corresponding opt-out with an eye-brow raising name like --use-deprecated=legacy-leaky-auth.
  • Release.
  • At this point, we'll get a barage of user feedback about whether they like this behaviour or not. We'll need to have triage capacity for that.

Release X+4:

  • Drop the opt-in and opt-out flags.

To set expectations here -- I'm likely not going to have time to look at this for about a month after today since I'll likely be AFK during that period. :)

@potiuk
Copy link
Contributor Author

potiuk commented May 14, 2022

Yeah. I will think a bit more about it too and maybe look in the code of pip. But maybe another way will be to simply pass "actual" repository name to the method that caches the credentials.

I think the real intention here - for sites that support multiple pip repos - that the "files" served for each repository will always be served under the "repo" path anyway. They might be served "above" the simple URL path as in the example above, but they will never be served above $PROJECT/repo part.

So in the case here:

For project/repo ${PROJECT}/my-pip-repo: https://oauth2accesstoken:****@us-python.pkg.dev/$PROJECT/my-pypi-repo/simple/simple-package
The files served might be:

But they will never, ever be served (if authentication is required to downlad the files) under:

Because THAT would violate the intention of spearating packages and files per repos (and RFC7617).

Currently, the problem is that when we see the authentication information for the first time, we just see
https://oauth2accesstoken:****@us-python.pkg.dev/$PROJECT/my-pypi-repo/simple/simple-package - because this is the first time we see the request (and we have no idea what he actual "repo" URL is).

However we actualy know at this point that https://us-python.pkg.dev/$PROJECT/my-pypi-repo is the actual repository that we are trying to get the package information from. Simply - we do not pass this information down to the place where we cache authentication information.

So one other solution I see is to add extra information on the "base" URL of the repository we are actually reaching out to, to the part of code where we cache credentials - and then we could cache credentials precisely for that "repo" URL.

This would be working fine, regardless from the suffix chosen (whether it is simple or not) and the path used by the multi-repo developers to serve the files, as long as the basic property of "the credentials you pass are valid only for the pip repository that you act on even if it is part of multi-repository domain" is held (which I think is pretty reasonable and actually what the intention is).

The only potential problem I see is if someone develops a solution hosting multiple pip repositories where "authentication" protected files would be shared between multiple repositories. in a short path:

However (unlike the previous cases) this is wrong according to RFC7617, because even in pip configuration you will not provide credentials for the whole "domain" but you will provide them for repo https://us-python.pkg.dev/$PROJECT/my-pypi-repo/ - so your intention is not to provide it for the whole domain or $PROJECT.

I think that might work


I will try to take a closer look at that while you are AFK @pradyunsg . Have fun.

@pradyunsg
Copy link
Member

pradyunsg commented May 14, 2022

They might be served "above" the simple URL path as in the example above, but they will never be served above $PROJECT/repo part.

Well, it's feasible that the files that you download are behind different credentials than the index server. The following (imagined) situation is a completely valid implementation of co-operating indexes:

A internal.foo.com/pypi-team1/index/<project>.html file, that lists:

  • internal.foo.com/pypi-sources/<project>/<project>-<version>.tar.gz

A internal.foo.com/pypi-team2/index/<project>.html file, that lists:

  • internal.foo.com/pypi-built-internally/<project>/<project>-<version>-<tag>.whl

IIUC, pip's current behaviour is to use credentials for team1 in pypi-sources, team2 and pypi-built-internally packages -- since they're all behind the same hostname.

IIUC, the whole request is that we should to allow users to set different credentials for each of them, picking the correct one based on which one matches the prefix best. This means that users now need to specify credentials for pypi-sources and pypi-build-internally -- something that we don't really provide a mechanism for. Reusing based on what index it came from is... wrong? I'd rather make unauthenticated requests to pypi-sources and pypi-build-internally, and fail. If there's organisational users who care about this use case, I'd prefer that they come forward and help design a solution to accomodate for it.

If we require it to be under the same prefix, then we shouldn't be keeping track of what index they've come from. It's one-or-the-other.


I also prefer the "assume things are under the same prefix" model, FWIW. As long as we have the opt-in/opt-out transition model for rolling this out, I think the current implementation (which, IIUC, does that) is fine and likely better long term than a pip-specific hacky solution that makes an assumption that could be wrong.

@pradyunsg
Copy link
Member

Honestly though, after all this discussion, I feel like it's probably fine to leave things as-is. This is the exact same model as the netrc file -- where the login credentials are set by (basically) a domain name.

@potiuk
Copy link
Contributor Author

potiuk commented May 17, 2022

Honestly though, after all this discussion, I feel like it's probably fine to leave things as-is. This is the exact same model as the netrc file -- where the login credentials are set by (basically) a domain name.

I think that actually wrong as it leaks credentials. This problem is not solved.

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label May 25, 2022
@potiuk
Copy link
Contributor Author

potiuk commented Jul 25, 2022

Actually - as you wish @pradyunsg - I thought about it and since it's not needed for me, but if the maintainer thinks this is not an issue and accept the risks involved - who am I to argue with it :).

Closing it then.

@potiuk potiuk closed this Jul 25, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using multiple PIP indexes on the same hostname with different credentials does not work
5 participants