Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when parsing a percent symbol #235

Open
faunris opened this issue Aug 29, 2018 · 13 comments
Open

Bug when parsing a percent symbol #235

faunris opened this issue Aug 29, 2018 · 13 comments
Labels

Comments

@faunris
Copy link

faunris commented Aug 29, 2018

>>> URL('http://ex.com/%F0').path
'/%F0'
>>> URL('http://ex.com/%F0%')
URL('http://ex.com/%F0%25')
>>> URL('http://ex.com/%F0%').path
'/%25' # exptected: /%F0%25

the last percent character breaks path

python: 3.6.3
yarl: 1.2.6

@gyermolenko
Copy link
Contributor

@faunris Hi
Could you please explain why you expect /%F0% turn into /%F0%25?
As far as I know neither %F0 nor %F0% are valid url path sequences.
And percent symbol should be decoded within context of following two HEXDIGs.
(I could be wrong, but this is my current understanding)

@faunris
Copy link
Author

faunris commented Oct 30, 2018

@gyermolenko
Hello, sorry about long answer.
I think if

URL('http://ex.com/%F0').path

return '/%F0', then

URL('http://ex.com/%F0%').path

should retun '/%F0%25'

Becouse I think url decode \ encode should be equivalent

@asvetlov
Copy link
Member

I would say URL('http://ex.com/%F0%').path should produce '/%F0%'.
The current behavior is buggy

@asvetlov
Copy link
Member

I was wrong, a single percent should be percent-encoded as %25 according to RFC 3986:

Because the percent ("%") character serves as the indicator for
percent-encoded octets, it must be percent-encoded as "%25" for that
octet to be used as data within a URI.

@faunris assumption is correct:
URL('http://ex.com/%F0%').path should produce '/%F0%25'

@gyermolenko
Copy link
Contributor

imho If you absolutely need to encode custom invalid sequences you should be consistent and turn every "%" into "%25". E.g. "%F0%" into "%25F0%25".
Online url-encoders do that way.

@asvetlov
Copy link
Member

Looks viable, thanks.
My current vision is: the problem exists (while the case is pretty rare).
It should be fixed, but we need to figure out the desired behavior.
I appreciate any proposal.

@gyermolenko
Copy link
Contributor

Here are some results. I highlighted ones that I consider invalid in red.
https://docs.google.com/spreadsheets/d/1L3IKXMUh5Ya9D_PIT9ogaFeISj_SVEWMkblme3Xiyt0

Expected results correspond to my understanding of rfc3986 .
Also to 3rd party online tools (i.e. first googled result https://meyerweb.com/eric/tools/dencoder/). Although I review their results critically.

@faunris
Copy link
Author

faunris commented Dec 24, 2018

I think "%D0" is valid encoded symbol. It shouldn't additional encode.

For example:
%25 is pct-encoded
%F0 is pct-encoded
And
%D0 is pct-encoded
We don't need additional encode percent symbol for all pct-encoded group.

2.1 rfc3986

pct-encoded = "%" HEXDIG HEXDIG

2.4 rfc3986

Because the percent ("%") character serves as the indicator for
percent-encoded octets, it must be percent-encoded as "%25" for that
octet to be used as data within a URI. Implementations must not
percent-encode or decode the same string more than once, as decoding
an already decoded string might lead to misinterpreting a percent
data octet as the beginning of a percent-encoding, or vice versa in
the case of percent-encoding an already percent-encoded string.

@gyermolenko
Copy link
Contributor

@faunris

I think "%D0" is valid encoded symbol. It shouldn't additional encode.

which one? It is not ascii, it is not in unreserved url chars. Why is it valid?

must not
percent-encode or decode the same string more than once, as decoding
an already decoded string might lead to misinterpreting a percent
data octet as the beginning of a percent-encoding, or vice versa in
the case of percent-encoding an already percent-encoded string.

for me it means "turn % into %25 only once, to avoid %25 turning into %2525 and so on"

@faunris
Copy link
Author

faunris commented Dec 24, 2018

@gyermolenko

which one? It is not ascii, it is not in unreserved url chars. Why is it valid?

Becouse standart say:

pct-encoded = "%" HEXDIG HEXDIG

and in terms of standard
% 25 and %D0 valid symbol

@gyermolenko
Copy link
Contributor

Not everything that fits into "%" HEXDIG HEXDIG is valid (i.e. can be properly decoded back).
Hence my question - what string was encoded into "%D0"?

@serhiy-storchaka
Copy link
Contributor

serhiy-storchaka commented Sep 27, 2020

It is discussable how %F0 should be decoded, but decoding %25 depends on the preceding characters:

>>> u = URL('/%25'); u.raw_path, u.path
('/%25', '/%')
>>> u = URL('/%F0%25'); u.raw_path, u.path
('/%F0%25', '/%25')

Should not %25 be always decoded as %? This looks like a bug.

@serhiy-storchaka
Copy link
Contributor

There is also difference between Python and Cython implementations.

Python implementation:

>>> URL('/%/%25')
URL('/%25/%25')

Cython implementation:

>>> URL('/%/%25')
URL('/%25/%2525')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants