-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify behaviour of getRequestURI(), getContextPath(), getServletPath() and getPathInfo() #18
Comments
@glassfishrobot Commented |
@glassfishrobot Commented |
@glassfishrobot Commented |
|
An additional point requiring clarification is that RFC 3986 (section 2.2) states that %nn encoded reserved characters are not equivalent to their decoded forms. i.e. "/foo/bar/" and "/foo%2Fbar" are not equivalent. This raises an additional question. When a %nn reserved character appears in any part of the URI should it be decoded or not in the return value of each of the above methods? I'm leaning towards a solution that leaves them in their %nn form and adds a utility method to the API that performs %nn decoding. |
the servlet spec in section 3.1 says:
I think when those words were written URIs were defined by RFC2396 which well defined path parameters in section 3.3: path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | "," However, that RFC has since been obsoleted by RFC3986 which now no longer has a normative definition for path parameters and instead just says:
So this gives us two problems. Firstly I now cannot find any current specification of HTTP URL path parameters and how they should be parsed. So I think the servlet spec could be relying on the grandfathered definition from RFC2396. IS this true? Does anybody know of a current specification for them? Secondly, there are URIs that if RFC3986 is applied to (ie normalisation, then decoding and param handling) then the result is very ambiguous when received as a string. Examples include:
|
I'd love to see us make some progress on this in the next iteration of the Servlet spec. There are a lot of edge cases and interdependencies between edge cases as well as places where the meaning RFCs may not be immediately clear and/or has been misinterpreted in the past. We have a wiki so my suggestion is that we draft some text so we have a common, agreed understanding in the wiki, with examples, and then discuss the changes we wish to make to the specification document and/or the Javadoc so that the specification is aligned with that common understanding. I'll get started on that Wiki page and I'll post again here when I have something ready to start discussing. |
Wiki page has been created. I think the first thing to tackle is path parameters. https://github.com/eclipse-ee4j/servlet-api/wiki/HTTP-URIs-and-Servlet-API-methods |
I think we also need to consider security issues that are introduced by ill defined path parameters. Jetty recently "improved" our URI handling to be more compliant with RFC3986 and that just resulted in a bunch of CVEs as it created many issues. For example in the URI Fundamentally |
Agreed wholeheartedly re security issues. It is also important we clarify these ambiguities and get consistent behaviour between containers because inconsistent behaviour between containers can also be a source of vulnerabilities (if an app coded for one container works securely on that container it may not be secure on a different container).
The result is:
so my conclusion is that if we retain path parameters, we have to retain them all the way through the processing chain and if we do that the request will fail because the context path doesn't match, the servlet path doesn't match or the file doesn't exist. That is the second option in the wiki page option. The other option is to remove them early on in the processing. I've added a comparison of the options to the wiki. Despite arguing consistently for option 1 on the Tomcat lists for a number of years (and implementing it) I'm beginning to think that option 2 is the better option. |
@markt-asf I'm not entirely sure that we can separate discussion of I think an exercise that would be good to help clarify how these characters should be handled in received URIs is to better define what If the
Fundamentally, the problem with the decoded URI path string space is that once decoded, you can't tell the difference between a URIs If we want to be able to access all resources, the One way forward I see is:
|
I agree that As I have got my head around the evolution of path parameters in the URI RFCs from the very specific in RFC 1808 to the very generic in RFC 3986, I am now in agreement with you that we need to approach path parameters and %nn decoding in combination. I have a way forward in mind that I think is very similar to the way forward you expressed. I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:
This means files that use any reserved character other than Applications and frameworks that want to use path parameters can do so via the What I'd really like to do is to avoid deprecating One possibility would be for the application to declare which reserved characters it wished to use as delimiters.
|
I agree with a lot you say... but a few points inline...
We should also specifically prohibit %uXXXX decoding. This was a rejected proposal, but there are still some usages of it, so we still offer it as a legacy mode. Would be good to kill that completely.
The existing The other thing to consider is do we really need the decoded path broken down into servletPath and pathInfo? I see so much code that is adding those back together. How about we just provide a new
I don't like even more new methods, as the API is already very fat and many of these methods might need to be replicated in request attributes, which are already too many! So instead I think we just accept that the existing
The default mode could be
Firstly, the default servlet cannot map URIs directly to the file system and must go via a
See above. I'd prefer not to deprecate these methods as the vast majority of applications use them and have no need to ever see an path that contains a decoded reserved character. Thus the vast majority of applications will continue to work perfectly well with no code changes even if their container switches to
Remember that resources can be served from
But we do need to provide something like my proposed
URI handling should be portable and we really don't need any doubt about what are reserved characters or not, specially as the So in summary, I like the decoding approach you've outlined, but think we can achieve the outcome we want with:
|
Yes, I agree with pretty much all of that. It is great to feel that some progress is being made in this area. I wasn't aware
I'd forgotten that Regarding splitting I hear your objection regarding new methods and implications for request attributes. That is a good reason to try and minimise the number of new methods. I agree the default servlet needs to use something like My thinking around allowing applications to effectively remove some characters from the reserved set was to assist with migration with what was going to be a stricter regime. If the consensus is that that would create more ambiguity/problems than it solves, I'm happy to leave that as a potential container specific feature. I like the idea of allowing the application to operate in different modes. That provides a lot more flexibility without having to add a new set of methods for each mode. Regarding each mode:
After some further experiments with
If the returned string was then used in If we included the behaviour of |
I think the use-case for a mode like Ultimately, this scenario shows the limits of the modal approach, in that the mode really doesn't apply to the whole webapp, but should apply servlet by servlet and filter by filter. However I don't think that fine grained approach would be workable, as what do you do if a request is mapped to a servlet that is So I like your modes (and their names), but they only apply to the existing path methods and to the interpretation of the existing getResource(String) methods.
So we could get away with just |
I've been thinking about the modes and I'd really like to keep them to a minimum. I think we can get by with just a boolean legacy mode in which the When not in legacy mode then The behaviours of other modes described above can be obtained by filters or perhaps new security constraint types. To assist with that, it would be good to have a new method The I still think we need the new We could even avoid having an explicit legacy mode switch and instead just use the version of the web.xml file (with no web.xml defaulting to 6.0 semantics). In summary, in a 6.0 webapp, all the following methods would work in only-reserved-characters-encoded mode: |
I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points. I agree reducing modes is good. I've no objection to dropping For
For
Which brings us nicely to the topic of mapping. In
Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like We also need to decide if the vales in I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application. An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages. I think the final part of this puzzle is What was your thinking in requring an IAE if I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning. |
I'll digest your main responses tomorrow.... but I'll answer the easy one below.
The idea is that those methods should take precisely the strings returned from the path methods. Currently they accept decoded paths, so if there is a % character in the arg, it is considered part of the file name. If we start allowing arbitrary characters to be encoded, then we are more likely to break the few apps that do actually pass % characters. If we only accept % encoding for precisely the reserved characters that the path methods leave encoded then that a) forces us to very well define those characters; b) makes minimal changes to how those methods are used today (for the 99.999% case); c) any modes introduced are exactly the sane for what the path methods return and what the resource methods take. |
How this is handled all depends on where we see the future. Do we want an arbitrary Servlet-9.0 application to be able to continue to choose between the modes for the path methods - ie LEGACY is an entirely valid mode into the future; or is that the future of the path methods is to return the reserve-encoded form that we define in 6.0 and we only offer a legacy mode to ease the transition. I favour the later as carrying legacy modes over multiple major releases will result in the need to version the legacy modes. So if we say the path methods from 6.0 onwards are intended to return the reserve-encoded form, then I think it is entirely valid to use the existence of a < 6.0 web.xml to trigger the legacy mode. We can then add some verbage to say that a container is free to all the mode to be specifically configured and to override the web.xml. If we go this way, then we don't even need to define named modes a
Nope - I think everything is decoded, including
Yep, modulo treatment of
What about option 5. Treat ; characters other than as part of a JSESSIONID as just a normal character in a segment?
Hmmmmm maybe. But I think this just complicates things like asynchronous threads that call the path methods. Already it is a problem that an async thread looking at a request that is passed to a request dispatcher will see different values in a race with the handling of the request in and out of the dispatcher. If we make this per servlet, then an async thread will see different encoding values in a race. The real fix for this is to get rid of object-identity-wrapping so that async threads can be passed a container-wrapped request and the values they see from the path methods would never change.... but that is another discussion. Ultimately, it comes down to if this mode is something valid going forward or just some legacy help to get applications ported to the new 6.0 behaviour. I think the later, so I think having a context wide mode is fine. If there are really some filters/servlets that can run in that mode then we can provide utility wrappers that will take a 6.0 request running in the new mode and decode any reserved characters to provide the <6.0 behaviour (in fact I was already thinking of providing this mode with a container provided wrapper prior to the first dispatch - hence not making core code horridly modal).
I think the new methods I'm proposing can be justified even without this problem. Too frequently are
How about prior to implementing in Tomcat we create a branch of the api/spec and a draft PR. We can word craft the spec/javadoc/api changes we are describing above in that PR and that will ensure we are really on the same page (I think we are currently in the same chapter of the same book, so that's good). In parallel we can try out that branch in Tomcat and Jetty. |
@stuartwdouglas @markt-asf I've had a go at update the wiki. I wish there was a way to comment & suggest on the wiki like there is with PRs. Maybe we should move this to another PR so we can do so? |
I think a PR would be easier. I mostly agree with everything there (including the TODO's that need more work), and I have added an additional TODO: if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed? |
I'll move to a PR.... stand by..... |
Issue #18 URI path processing. Moved the wiki text to a PR for better comment and suggestions.
I have created #424. |
Define URI transport as description list with HTTP/1.0 and HTTP/3 included.
Added example table. too long and needs review
Issue #18 URI path processing Co-authored-by: Mark Thomas <[email protected]>
The final PR included an update to the Javadoc for HttpSservletRequest.getContextPath() that stated the return value would be canonicalized. However, this method has always returned exactly what the client sent. No canonicalization. No decoding. |
We've always interpreted this as returning the encoded context path, but what is configured on the server rather than what the client sent. I think returning whatever the client sent is a bit non deterministic. |
We have a narrow window of opportunity to fix this for 6.1 since I need to re-roll the release to fix the Javadoc copyright dates. I intend to simply remove the "The path will be canonicalized as per..." sentence that was added unless we managed to agree an alternative text very quickly (today). My understanding of the expected behaviour is that for a context path of
The no decoding (2nd line) is explicit in the original Javadoc. No canonicalization (3rd line) was implied. From memory that requirement originated from a need to be able to strip the context path from the return value of My proposal for a change is to align it with
|
I applied the simple fix that removed the "The path will be canonicalized as per..." sentence and re-tagged. We can agree the new wording for 6.2 in slightly slower time. |
I'm good with the simple fix for 6.1. I'm still cautious about the idea of returning what the client sent for the context path, least not that it is actually difficult to work out. But that's an ambiguity that had existed before asked nobody had complained that jetty returns /aaa for all your examples. We only return an encoded character if it needs to be encoded in they context path. But agree we can sort that out in 6.2 |
The specification is unclear for some or all of the above methods how the following should be handled.
The text was updated successfully, but these errors were encountered: