Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI parser cannot parse URNs #379

Open
Reisen opened this issue Jan 2, 2020 · 4 comments
Open

URI parser cannot parse URNs #379

Reisen opened this issue Jan 2, 2020 · 4 comments

Comments

@Reisen
Copy link

Reisen commented Jan 2, 2020

It seems the URI parser stumbles with URNs. A simple test to reproduce:

fn main() {
    use http::uri::Uri;
    let uri = "urn:isbn:1238492".parse::<Uri>();
    println!("{:?}", uri);
}

Produces the output:

$ ./testcase
Err(InvalidUri(InvalidAuthority))

The authority is only present when the scheme is followed by //, otherwise the parser should be parsing everything after the first : as the path. Python gets this right, and Go parses this as Opaque (instead of Host, but still successfully parses):

Python:

>>> urlparse('urn:isbn:0451450523')
ParseResult(scheme='urn', netloc='', path='isbn:0451450523', params='', query='', fragment='')

Go:

>>> url, _ := url.Parse("urn:isbn:0451450523")
>>> fmt.Println("Host (Opaque):", url.Opaque)
Host (Opaque): isbn:0451450523
@yifanmai
Copy link

yifanmai commented Feb 1, 2020

The main challenge in implementing this is that the Uri parser allows inputs with no scheme. For example, the parser parses localhost:3000 as having no scheme and an authority of localhost:3000. This means that some inputs are ambiguous. As a hypothetical example, for the input custom:3000, it is unclear if the scheme is custom and the path is 3000, or if there is no scheme and the authority is custom:3000.

The quickest fix would be to special-case the urn scheme (http and https are already special-cased). Unfortunately this would not work with custom schemes such as in the previous example.

Is there a use case that would require supporting URNs?

@Reisen
Copy link
Author

Reisen commented Feb 3, 2020

So URNs are actually designed to be valid URIs so if the parser respected RFC 3986 they should already automatically be parseable and not need any extra support. The issue here is the current URI parser implementation isn't correct.

In your examples both localhost and custom would unambiguously be the scheme and for a URI with no scheme you would have to enter //localhost:3000 (see here).

Other languages do get this right:

Python:

>>> urlparse('localhost:8000')
ParseResult(scheme='localhost', netloc='', path='3000', params='', query='', fragment='')
>>> urlparse('//custom:3000')
ParseResult(scheme='', netloc='custom:3000', path='', params='', query='', fragment='')

Go:

>>> url, _ := url.Parse("custom:3000")
Opaque: 3000 , Host:  , Scheme: custom , Path: 
>>> url, _ := url.Parse("//custom:3000")
Opaque:  , Host: custom:3000 , Scheme:  , Path: 

My personal use case is that I receive URIs from a third party of which URNs are mixed into the incoming data. I reached for a URI implementation expecting it to work and in the end switched to different language to write the tool.

@yifanmai
Copy link

yifanmai commented Feb 6, 2020

Strangely enough, I get a different result for Python (2.7.17rc1 and 3.7.5rc1):

>>> urlparse('localhost:8000')
ParseResult(scheme='', netloc='', path='localhost:8000', params='', query='', fragment='')
>>> urlparse('//custom:3000')
ParseResult(scheme='', netloc='custom:3000', path='', params='', query='', fragment='')

While I agree with you regarding RFC 3986 conformance, the current parser is already non-conformant and does something similar to the Python implementation. So changing the behavior now might have undesirable effects on downstream hyper clients.

@Reisen
Copy link
Author

Reisen commented Feb 11, 2020

So after doing a bit of digging it seems that Python had special handling for ports which caused it to break when paths are numeric. It's also doing the wrong thing there but they did in fact fix this. Very recently it seems:

python/cpython#16839
python/cpython#661

It's likely not being backported to older releases for exactly the reason you mention, as peoples code would suddenly start breaking on only a minor release bump.

I think this should still be fixed and simply be included in a major release. Otherwise this library essentially empowers the Rust ecosystem to do the wrong thing. If Python can fix this after 20 years, it's definitely not too late here 🙂

I'm happy to try and make the changes myself if the change is considered acceptable by the hyper maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants