Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent-decode more stuff? #87

Closed
SimonSapin opened this issue Feb 10, 2016 · 39 comments
Closed

Percent-decode more stuff? #87

SimonSapin opened this issue Feb 10, 2016 · 39 comments

Comments

@SimonSapin
Copy link
Contributor

In #77, I was surprised that . was such a special case. It is indeed in Firefox 44, but in Chromium 48 other characters are also percent-decoded: -, _, ~, and ASCII alpha-numerics. (I don’t have other browsers to test right now.)

http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=3878

It may be worth testing other URL components too.

@annevk
Copy link
Member

annevk commented Feb 10, 2016

I think it's a special case in most browsers from when I tested this last time around.

@annevk
Copy link
Member

annevk commented Oct 19, 2016

@achristensen07 what are your ideas here? I believe @valenting wanted to do more decoding in Gecko as well: https://bugzilla.mozilla.org/show_bug.cgi?id=1197123.

It's basically a choice between this document, Safari, and Firefox and Chrome and Edge. Arguably what Chrome and Edge do is somewhat nicer since there is more canonicalization, but it could lead to subtle issues too.

@achristensen07
Copy link
Collaborator

I think I'm opposed to this change. alert(new URL("http://host/p%61th")); should show the %61, not 'a'. There are a lot of URL implementations based on old RFCs, and we should try to match their behavior as closely as possible. People sometimes want '%' in the path and often in the query, and we don't want people to have to put things like "%2525" in their URL for some user agents and not others.

@annevk
Copy link
Member

annevk commented Oct 20, 2016

@achristensen07 note that the old RFCs encourage normalizing these in the second paragraph of https://tools.ietf.org/html/rfc3986#section-2.3.

@annevk
Copy link
Member

annevk commented Oct 20, 2016

To be clear, not %25, but a lot of other ranges, including %61.

@achristensen07
Copy link
Collaborator

Sure enough, Chrome and Edge percent decode more things in the path, but not the query. This can bee seen with alert(new URL("http://host/%45%25?%45%25")); We definitely shouldn't touch the query and all browsers seem to agree on this.
I disagree with this proposal and rfc3986 because I think a canonicalized URL should always be untouched when reparsed. With this proposal, "http://host/%%36%31" would be canonicalized to "http://host/%61" which when reparsed would become "http://host/a" which is bad. Right now Chrome percent-encodes the first '%' in "http://host/%%36%31" which is strange, and Edge throws an exception.
It is clear that something needs to change and that we need to unite behavior. Right now, it is impossible to send "GET /%2e HTTP/1.1" as the first line of an HTTP request and I think that also needs to change. I'm not sure what the solution should be, but percent-decoding more characters isn't it.

@annevk
Copy link
Member

annevk commented Oct 21, 2016

Paging @foolip to see if we can get Chrome's perspective on this.

@annevk
Copy link
Member

annevk commented Oct 21, 2016

@valenting I suggest that before changing Gecko here we try to bring this thread to a conclusion first.

@achristensen07
Copy link
Collaborator

Is it common on the internet to see percent-encoded dots or other characters in the path? Why was section 2.3 added to rfc3986? Safari has not percent-decoded anything in the path, and I'm not aware of any compatibility issues, but it's possible I've missed something. A solution to this problem would be to just not percent-decode anything in the path, including %2e and %2E

@gibson042
Copy link

Question: to which of the following should "http://host/%%36%31" be equivalent?

  • "http://host/%2561" (representing the non-decodable percent as a percent and recovering immediately)
  • "http://host/%25%25361" (representing the non-decodable percent as a percent and recovering after the character responsible for its non-decodability)
  • "http://host/%25%2536%2531" (giving up on percent decoding completely upon encountering a non-decodable percent)
  • nothing (it is invalid, as in the older RFCs)
  • other (e.g., "http://host/a") (please no)

@achristensen07
Copy link
Collaborator

Safari and Firefox both leave "http://host/%%36%31" untouched when canonicalizing. They also agree on the behavior of URLs like "http://host/%2e" and I suggest we change the spec to match this. What is the rationale behind Chrome's percent-encoding only the first percent, and percent decoding some characters but not others? It leads to some confusing behavior.

@valenting
Copy link
Collaborator

valenting commented Oct 25, 2016

They also agree on the behavior of URLs like "http://host/%2e" and I suggest we change the spec to match this.

We actually tried to change it in Firefox so that we decode %2e but it got backed out due to an implementation bug. :)

@achristensen07
Copy link
Collaborator

I still think we should change the spec to not percent-decode anything in the path. Percent-decoding %2E makes it so there is only one thing that cannot be in the path. Percent-decoding more leads to strange reparsing behavior. Simpler is better, unless there's a good reason we need this.

kisg pushed a commit to paul99/webkit-mips that referenced this issue Oct 25, 2016
https://bugs.webkit.org/show_bug.cgi?id=163929

Reviewed by Alexey Proskuryakov.

LayoutTests/imported/w3c:

* web-platform-tests/url/a-element-expected.txt:
* web-platform-tests/url/a-element-xhtml-expected.txt:
* web-platform-tests/url/url-constructor-expected.txt:

Source/WebCore:

Covered by updated API tests, which show that URLParser now matches URL::parse in these cases.
Also covered by newly failing web platform tests, which were failing before URLParser was enabled.
If whatwg/url#87 is resolved we can change behavior to match.

* platform/URLParser.cpp:
(WebCore::URLParser::isSingleDotPathSegment):
(WebCore::URLParser::isDoubleDotPathSegment):
(WebCore::URLParser::consumeSingleDotPathSegment):
(WebCore::URLParser::consumeDoubleDotPathSegment):
(WebCore::URLParser::parse):
(WebCore::URLParser::isPercentEncodedDot): Deleted.

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):

LayoutTests:

* fast/url/path-expected.txt:
* fast/url/standard-url-expected.txt:
* fetch/fetch-url-serialization-expected.txt:



git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207795 268f45cc-cd09-0410-ab3c-d52691b4dbfc
kisg pushed a commit to paul99/webkit-mips that referenced this issue Oct 25, 2016
https://bugs.webkit.org/show_bug.cgi?id=163929

Reviewed by Alexey Proskuryakov.

LayoutTests/imported/w3c:

* web-platform-tests/url/a-element-expected.txt:
* web-platform-tests/url/a-element-xhtml-expected.txt:
* web-platform-tests/url/url-constructor-expected.txt:

Source/WebCore:

Covered by updated API tests, which show that URLParser now matches URL::parse in these cases.
Also covered by newly failing web platform tests, which were failing before URLParser was enabled.
If whatwg/url#87 is resolved we can change behavior to match.

* platform/URLParser.cpp:
(WebCore::URLParser::isSingleDotPathSegment):
(WebCore::URLParser::isDoubleDotPathSegment):
(WebCore::URLParser::consumeSingleDotPathSegment):
(WebCore::URLParser::consumeDoubleDotPathSegment):
(WebCore::URLParser::parse):
(WebCore::URLParser::isPercentEncodedDot): Deleted.

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):

LayoutTests:

* fast/url/path-expected.txt:
* fast/url/standard-url-expected.txt:



git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207805 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@annevk
Copy link
Member

annevk commented Oct 25, 2016

I don't understand. When I try new URL("https://example/%2e/test") I get https://example/test in Firefox (Nightly).

@annevk
Copy link
Member

annevk commented Oct 25, 2016

And not normalizing that seems dangerous for servers, since you might be able to trick the server to go outside the root of the site.

@valenting
Copy link
Collaborator

The bug was about %2e%2e%2e - or %2eX we don't normalize that.

@achristensen07
Copy link
Collaborator

What if we kept the definition of double-dot path segments to protect servers and maybe single-dot path segments, but removed this:
If c is "%" and remaining, ASCII lowercased starts with "2e", append "." to buffer and increase pointer by two.

@annevk
Copy link
Member

annevk commented Oct 26, 2016

Yeah, that could work. I'd still really like to hear from @sleevi / @foolip.

@foolip
Copy link
Member

foolip commented Oct 26, 2016

I don't know Blink's KURL or what lies beneath well enough to judge what that change might mean. @sleevi?

Tests that exercise the difference might be useful, if they'd go from pass to fail or vice versa in Chrome it should be easier to understand where to look. https://github.com/w3c/web-platform-tests/tree/master/url is the right place I guess?

@annevk
Copy link
Member

annevk commented Oct 26, 2016

@foolip yes.

@gibson042
Copy link

Safari and Firefox both leave "http://host/%%36%31" untouched when canonicalizing.

And therefore equivalent to "http://host/%25%2536%2531", right?

@achristensen07
Copy link
Collaborator

No, none of those percents are encoded or decoded.

Also, I don't think it is adequate to protect servers by relying on the client getting rid of %2e%2e in the path. A server should have to protect itself from requests for ../../../etc. even from malicious clients that do not obey the URL spec. I don't see how parsing %2e%2e as .. protects servers.

@annevk
Copy link
Member

annevk commented Oct 27, 2016

@achristensen07 you cannot craft malicious requests like that from a browser (and therefore with the user's credentials).

@sleevi
Copy link

sleevi commented Oct 27, 2016

@annevk Sorry for the delay. The rationale for Chrome's current behaviour is, I believe, captured here in this comment block - https://cs.chromium.org/chromium/src/url/url_canon_path.cc?rcl=0&l=167

The reason for that change is explained in https://chromium.googlesource.com/chromium/src/+/6982036dc15ba85cdbf7339265f030ff7582ff3b - and the related crash.

So there are definitely implications to changing Chrome that make it harder to be an easy flip, but it's not to say it's unreasonable to explore changing.

@gibson042
Copy link

No, none of those percents are encoded or decoded.

This project claims an explicit goal of replacing RFC 3986 and RFC 3987 by aligning with contemporary implementations, presumably affecting both clients and servers as those documents do. And since percent-decoding is already specified, "http://host/%36%31" is equivalent to "http://host/61". The question is what equivalence class (if any) "http://host/%%36%31" falls into (e.g., Chrome treats it as equivalent to "http://host/%2561", one option of at least four).

@annevk
Copy link
Member

annevk commented Oct 28, 2016

@gibson042 percent-decoding is not specified by the URL Standard (at least not to a significant extent). Replacing those RFCs does not mean that we'll prescribe what they prescribe.

@annevk
Copy link
Member

annevk commented Oct 28, 2016

It seems we avoid most issues here by simply not percent-decoding (other than host, where a one-time decode pass followed by IDNA is fine as it will return failure if anything fails).

My conclusion here is to embrace the suggestion in #87 (comment) to only work with double-dot path segments and single-dot path segments and not decode %2e generally (or anything else).

We need to make sure that is covered by the testsuite.

Gecko will need to WONTFIX https://bugzilla.mozilla.org/show_bug.cgi?id=1197123 (action item for @valenting I think).

Chrome and Edge will have to change their behavior. I think in the end that will turn out to be a drastic code simplification for them since they no longer have to worry about cases such as %%300.

@annevk
Copy link
Member

annevk commented Oct 31, 2016

@sleevi I'm not sure I follow. If there's no normalization during parsing, there's no normalization in the model, and there's no normalization exposed in the API. Where do you think normalization would happen?

@sleevi
Copy link

sleevi commented Oct 31, 2016

@annevk I'm suggesting that normalization is an intrinsic necessity of Chrome, at present, and as such, ends up reflected in the API we surface. While it's possible to suggest that's changable, it's not currently a high priority - that is, the normalization that occurs is an intrinsic dependency of many of Chromium's URL-handling subsystems. If we change the parsing phases to remove normalization, this creates more issues than it solves. Put differently, if normalization is removed, it's unlikely Chromium will be able to align on this point in any reasonable timeframe.

@annevk
Copy link
Member

annevk commented Oct 31, 2016

@sleevi could you perhaps explain why it's an intrinsic necessity in some detail? And I guess also, what exactly the normalization entails?

@sleevi
Copy link

sleevi commented Oct 31, 2016

@annevk It's important to understand, as mentioned before, that our URL code is shared across a variety of Google client platforms, and, even within Chrome, shared through a variety of layers - IPC, networking, UI (and, of that, security UI as a distinct subset), and of course, the web-visible exposure provided by the URL API.

While the URL API itself has no intrinsic needs that we're immediately aware of (short of checking compat issues), what is essential for a variety of checks and comparisons is that we have a consistent set of behaviours and presentations for URLs, and that inherently necessitates normalization, such that any two URLs, if they represent the same underlying URL, are treated equivalently. This behaviour has been extensively integrated into our code, because it was one of the motivations for the library itself.

Again, that's not to say we can't change, but it's certainly not a priority, given how significant an undertaking it would be. I think @gibson042's concerns are reasonable, and I suspect that many web developers would be surprised. I have trouble accepting @achristensen07's arguments against it.

To put the counter-argument forward: Why do you feel that normalization would be bad?

@annevk
Copy link
Member

annevk commented Nov 1, 2016

I'm not necessarily opposed, if we can find a reasonable processing model that everyone can agree on and doesn't lead to subtle issues with input such as %%300. I'm just trying to explore options.

And note that the URL API is not the only bit that's web-visible. HTTP requests, <a>, location, etc. are all part of the equation.

@annevk
Copy link
Member

annevk commented Dec 29, 2016

I think for now I'm proceeding with the plan on aligning the standard with what Firefox and Safari implement.

We don't have a credible alternative proposal and the current document is aligned with nobody.

And while RFC3986 does allow for additional normalization, it doesn't prescribe it, so I think either way we're roughly aligned with that. We could mention explicitly that servers are allowed to perform additional normalization and maybe should do that at some point. Or even expose an API to clients.

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 29, 2016
annevk added a commit that referenced this issue Dec 29, 2016
This basically reverts bee5ad8 as it
did not end up shipping in Firefox and the tentative conclusion is that
percent-decoding in URLs while parsing creates more trouble than it’s
worth.

We’ll still handle %2e when it’s part of a single-dot or double-dot
path segment.

Fixes #87.
annevk added a commit that referenced this issue Dec 29, 2016
This basically reverts bee5ad8 as it
did not end up shipping in Firefox and the tentative conclusion is that
percent-decoding in URLs while parsing creates more trouble than it’s
worth.

We’ll still handle %2e when it’s part of a single-dot or double-dot
path segment.

Tests: web-platform-tests/wpt#4104.

Closes #87.
annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 3, 2017
This reverts commit 47d2089 and also updates setter tests that were added later.

See whatwg/url#87 for discussion and whatwg/url#156 for the URL Standard PR.
annevk added a commit that referenced this issue Jan 3, 2017
This basically reverts bee5ad8 as it
did not end up shipping in Firefox and the tentative conclusion is that
percent-decoding in URLs while parsing creates more trouble than it’s
worth.

We’ll still handle %2e when it’s part of a single-dot or double-dot
path segment.

Tests: web-platform-tests/wpt#4104.

Closes #87.
jasnell added a commit to jasnell/node that referenced this issue Jan 4, 2017
Per a recent change to the URL spec, arbitrary %2e sequences
in URL paths that are not single or double dot segments are
not to be decoded.

Refs: whatwg/url#87
Refs: whatwg/url#156
Refs: web-platform-tests/wpt@d93247d
Fixes: nodejs#10598
jasnell added a commit to nodejs/node that referenced this issue Jan 6, 2017
Per a recent change to the URL spec, arbitrary %2e sequences
in URL paths that are not single or double dot segments are
not to be decoded.

Refs: whatwg/url#87
Refs: whatwg/url#156
Refs: web-platform-tests/wpt@d93247d
Fixes: #10598
PR-URL: #10602
Reviewed-By: Michal Zasso <[email protected]>
Reviewed-By: Italo A. Casas <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 18, 2017
Per a recent change to the URL spec, arbitrary %2e sequences
in URL paths that are not single or double dot segments are
not to be decoded.

Refs: whatwg/url#87
Refs: whatwg/url#156
Refs: web-platform-tests/wpt@d93247d
Fixes: nodejs#10598
PR-URL: nodejs#10602
Reviewed-By: Michal Zasso <[email protected]>
Reviewed-By: Italo A. Casas <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 19, 2017
Per a recent change to the URL spec, arbitrary %2e sequences
in URL paths that are not single or double dot segments are
not to be decoded.

Refs: whatwg/url#87
Refs: whatwg/url#156
Refs: web-platform-tests/wpt@d93247d
Fixes: nodejs#10598
PR-URL: nodejs#10602
Reviewed-By: Michal Zasso <[email protected]>
Reviewed-By: Italo A. Casas <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 24, 2017
Per a recent change to the URL spec, arbitrary %2e sequences
in URL paths that are not single or double dot segments are
not to be decoded.

Refs: whatwg/url#87
Refs: whatwg/url#156
Refs: web-platform-tests/wpt@d93247d
Fixes: nodejs#10598
PR-URL: nodejs#10602
Reviewed-By: Michal Zasso <[email protected]>
Reviewed-By: Italo A. Casas <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
italoacasas pushed a commit to italoacasas/node that referenced this issue Jan 27, 2017
Per a recent change to the URL spec, arbitrary %2e sequences
in URL paths that are not single or double dot segments are
not to be decoded.

Refs: whatwg/url#87
Refs: whatwg/url#156
Refs: web-platform-tests/wpt@d93247d
Fixes: nodejs#10598
PR-URL: nodejs#10602
Reviewed-By: Michal Zasso <[email protected]>
Reviewed-By: Italo A. Casas <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
ryanhaddad pushed a commit to WebKit/WebKit that referenced this issue Dec 22, 2020
https://bugs.webkit.org/show_bug.cgi?id=163929

Reviewed by Alexey Proskuryakov.

LayoutTests/imported/w3c:

* web-platform-tests/url/a-element-expected.txt:
* web-platform-tests/url/a-element-xhtml-expected.txt:
* web-platform-tests/url/url-constructor-expected.txt:

Source/WebCore:

Covered by updated API tests, which show that URLParser now matches URL::parse in these cases.
Also covered by newly failing web platform tests, which were failing before URLParser was enabled.
If whatwg/url#87 is resolved we can change behavior to match.

* platform/URLParser.cpp:
(WebCore::URLParser::isSingleDotPathSegment):
(WebCore::URLParser::isDoubleDotPathSegment):
(WebCore::URLParser::consumeSingleDotPathSegment):
(WebCore::URLParser::consumeDoubleDotPathSegment):
(WebCore::URLParser::parse):
(WebCore::URLParser::isPercentEncodedDot): Deleted.

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):

LayoutTests:

* fast/url/path-expected.txt:
* fast/url/standard-url-expected.txt:
* fetch/fetch-url-serialization-expected.txt:



Canonical link: https://commits.webkit.org/181656@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207795 268f45cc-cd09-0410-ab3c-d52691b4dbfc
ryanhaddad pushed a commit to WebKit/WebKit that referenced this issue Dec 22, 2020
https://bugs.webkit.org/show_bug.cgi?id=163929

Reviewed by Alexey Proskuryakov.

LayoutTests/imported/w3c:

* web-platform-tests/url/a-element-expected.txt:
* web-platform-tests/url/a-element-xhtml-expected.txt:
* web-platform-tests/url/url-constructor-expected.txt:

Source/WebCore:

Covered by updated API tests, which show that URLParser now matches URL::parse in these cases.
Also covered by newly failing web platform tests, which were failing before URLParser was enabled.
If whatwg/url#87 is resolved we can change behavior to match.

* platform/URLParser.cpp:
(WebCore::URLParser::isSingleDotPathSegment):
(WebCore::URLParser::isDoubleDotPathSegment):
(WebCore::URLParser::consumeSingleDotPathSegment):
(WebCore::URLParser::consumeDoubleDotPathSegment):
(WebCore::URLParser::parse):
(WebCore::URLParser::isPercentEncodedDot): Deleted.

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):

LayoutTests:

* fast/url/path-expected.txt:
* fast/url/standard-url-expected.txt:



Canonical link: https://commits.webkit.org/181662@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207805 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants