Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tools.uri) normalization decodes as much as possible #8140

Merged
merged 11 commits into from
Jul 26, 2022

Conversation

bungle
Copy link
Member

@bungle bungle commented Dec 2, 2021

We decide to let normalize function to decode URL-encoded string as much as possible.

PLEASE REFERER TO: #8140 (comment)

Issues resolved

Fix #7913, FTI-2904

Outdated discussion:


This is alternative to PR #8139 where we actually fix the normalization function to not do excessive percent-decoding on normalization.

When we added normalization kong.tools.uri.normalize, that function does percent-decoding on everything, except for the reserved characters.

That means that we basically percent-decode more than just the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E). (so called Unreserved Characters)

Alternative Implementation: See #8139

@bungle bungle added the pr/discussion This PR is being debated. Probably just a few details. label Dec 2, 2021
@bungle bungle requested a review from dndx December 2, 2021 11:06
@bungle bungle force-pushed the fix/uri-normalize branch 3 times, most recently from 3fc077f to a9c33a4 Compare December 2, 2021 15:42
@bungle bungle added core/router pr/please review and removed pr/discussion This PR is being debated. Probably just a few details. labels Dec 2, 2021
@bungle bungle added this to the 2.7 milestone Dec 3, 2021
@flrgh
Copy link
Contributor

flrgh commented Dec 4, 2021

Hmm. RFCs are always a chore to interpret. It seems like the original mistake was that normalization was implemented with the assumption that unreserved chars == (all chars - reserved chars), which, as you know, turns out to be incorrect (according to the RFC).

IMO instead of over-decoding and re-encoding we should just fix the logic to be more selective in our decoding, according to the chars in section 2.3:

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

I put a draft/poc of this in another branch (fix/uri-normalize-unreserved 6313dd2). Maybe take a look and tell me what you think?

@bungle
Copy link
Member Author

bungle commented Dec 7, 2021

Hmm. RFCs are always a chore to interpret. It seems like the original mistake was that normalization was implemented with the assumption that unreserved chars == (all chars - reserved chars), which, as you know, turns out to be incorrect (according to the RFC).

IMO instead of over-decoding and re-encoding we should just fix the logic to be more selective in our decoding, according to the chars in section 2.3:

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

I put a draft/poc of this in another branch (fix/uri-normalize-unreserved 6313dd2). Maybe take a look and tell me what you think?

@flrgh yes, we discussed this with @dndx. The thing is that I am not sure which one is better. More selective percent decoding or not. I am not sure can there ever be things already percent decoded in ngx.var.request_uri or not. Do you? If there could be, then those need to be normalized back to percent encoded form. Similar question goes to route.paths. Also as this is kong.tools.uri.normalize, it should support normalizing this: "/ä/a/%2e./a%2E%5f%99%af" should produce "/%C3%A4/a._%99%AF" which is done decoding, process dots, encoding approach I took here (it is difficult to do in one pass). The performance of this PR should be same as before, aka it does not do more work than before. We have already seen usages on other PRs, like yours: https://github.com/Kong/kong/pull/8129/files (if this is about to be tool for normalization it needs to do correct thing).

Also the escaping includes % in non-escape chars which is rather questionable:
https://github.com/Kong/kong/blob/master/kong/tools/uri.lua#L38

@kikito
Copy link
Member

kikito commented Dec 9, 2021

I have spoken with @bungle and for now we're removing this from the 2.7 milestone

@kikito kikito removed this from the 2.7 milestone Dec 9, 2021
@bungle
Copy link
Member Author

bungle commented Jan 18, 2022

If someone wants to work or try different approaches, here is a list that need to be taken in account:

definitions:
reserved chars: ! * ' ( ) ; : @ & = + $ , / ? % # [ ]
unreserved: - _ . ~
alphanumeric: a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9
other: * (meaning any char that is not mentioned above)

dot processing:
/. AND /./
/.. AND /../
/%2e AND /%2e/
/.%2e AND /.%2e/
/.%2E AND /.%2E/
/%2e%2E AND /%2e%2E/
etc.

merge slashes (this is generally thought to be good thing to do, but it might change semantics in some cases):
//
/a//b
etc.

given this:
/ä/a/%2e./a%2E//a/%2e/./a/../a/%2e%2E/%5f%99%af%2f%2F

should normalize (I think) to this:
/%C3%A4/a./a/_%99%AF%2F%2F

What happens there?
ä -> being percent encoded to %C3%A4
%2e -> being uppercased to %2E
// -> slashes being merged
%5f -> being percent decoded to _
/../ -> dot segment being processed
/%2e/ -> dot segment being processed
/a%2E/ -> dot being percent decoded to /a./

In general:

  • unreserved and alphanumeric should be percent decoded
  • reserved should be kept as is (no percent decoding, no percent encoding) BUT percent encoded forms should be turned to uppercase
  • rest should be percent encoded

Then you need to implement dot processing.

Current issue with kong.tools.uri.normalize (as a generic tool):

  • it over percent decodes (basically it percent decodes everything)
  • it does not percent encode

kikito
kikito previously approved these changes Jan 25, 2022
kong/tools/uri.lua Outdated Show resolved Hide resolved
kong/tools/uri.lua Outdated Show resolved Hide resolved
kong/tools/uri.lua Outdated Show resolved Hide resolved
kong/tools/uri.lua Outdated Show resolved Hide resolved
kong/tools/uri.lua Outdated Show resolved Hide resolved
@StarlightIbuki StarlightIbuki requested a review from dndx July 7, 2022 02:49
fffonion pushed a commit that referenced this pull request Jul 7, 2022
It's not a very reliable and sound way to support percent-encoding in regex. We choose to tell users that we have a normalized (standard) form to match with so there's no ambiguity.

#8140 (comment)

fix CT-344
@StarlightIbuki StarlightIbuki self-requested a review July 12, 2022 03:31
@StarlightIbuki StarlightIbuki force-pushed the fix/uri-normalize branch 2 times, most recently from e63b391 to e352d77 Compare July 18, 2022 08:34
Copy link
Member Author

@bungle bungle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it, and it looks good to me. @flrgh, do you agree?

@StarlightIbuki StarlightIbuki changed the title fix(tools.uri) normalization function excessive percent decoding fix(tools.uri) normalization decodes as much as possible Jul 19, 2022
kong/tools/uri.lua Outdated Show resolved Hide resolved
Copy link
Contributor

@flrgh flrgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, I reviewed and left a couple nitpicks. The table.new thing should probably be fixed, but the other comment on chars_to_decode is mostly just a readability gripe, so not a blocker if anyone disagrees with me about it.

Behavior-wise, there's been a whole bunch of activity on the path handling/normalization discussion since I last participated, so as long as everyone else here is in agreement about this change, it gets my 👍.

bungle and others added 11 commits July 22, 2022 10:37
@fffonion fffonion merged commit b7c082e into master Jul 26, 2022
@fffonion fffonion deleted the fix/uri-normalize branch July 26, 2022 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uri_captures are url decoded
7 participants