Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http/wsgi.py: do not unquote path before injecting into environ #1211

Closed
wants to merge 1 commit into from

Conversation

jgehrcke
Copy link
Contributor

This is a follow-up to #930, where we decided that WSGI environ['PATH_INFO'] should not be an unquoted (percent-decoded) path, but the undecoded (original) path. This was then fixed in aiohttp's WSGI implementation, see aio-libs/aiohttp#177 -- but it looks like we missed taking care of Gunicorn's WSGI implementation.

In the current aiohttp implementation we find (https://github.com/KeepSafe/aiohttp/blob/master/aiohttp/wsgi.py):

path_info = uri_parts.path
[...]

environ['PATH_INFO'] = path_info

However, in gunicorn/http/wsgi.py we currently still find:

environ['PATH_INFO'] = unquote_to_wsgi_str(path_info)

This leads to a class of problems where escaped and non-escaped parts are not easily distinguishable anymore on framework-level, as vividly clarified by the following issue: https://code.djangoproject.com/ticket/15718
There, the behavior is partially controllable in Apache's mod_wsgi by setting AllowEncodedSlashes On.

There, the issue appeared with Django+mod_wsgi. In our case, the issue appeared with Falcon+Gunicorn, but factually it's exactly the same problem.

I think the most important insight is from aio-libs/aiohttp#177

IIRC WSGI standard doesn't specifies path unquoting.

So, I went ahead and simply removed the unquoting operation (exactly as done in aiohttp's wsgi.py). I ran tests against Python 3.4, and nothing broke.

This change affects three worker types (async, gthread, sync):

$ grep -HR "wsgi.create" .
./gunicorn/workers/async.py:            resp, environ = wsgi.create(req, sock, addr,
./gunicorn/workers/gthread.py:            resp, environ = wsgi.create(req, conn.sock, conn.addr,
./gunicorn/workers/sync.py:            resp, environ = wsgi.create(req, client, addr,

I think we can just merge this, mainly motivated by the fact that aiohttp uses the same method.

Still, it would be nice to now add a test that breaks with the old behavior, and passes with the new one. However, I am not warm enough with the gunicorn test structure.

And of course the question is if applications out there in the world rely on that behavior ... :/

In our application we have to be able to distinguish real slashes from escaped slashes on framework level. That is, we have to use a custom branch of Gunicorn right now -- otherwise that information is lost once requests enter framework level.

@berkerpeksag
Copy link
Collaborator

Gunicorn (and other servers like Werkzeug) follows PEP 3333 it implicitly requires PATH_INFO and friends to be unquoted. If you don't unquote PATH_INFO, you are going to get broken PATH_INFO. See https://www.python.org/dev/peps/pep-3333/#url-reconstruction for the URL reconstruction algorithm:

...
url += quote(environ.get('SCRIPT_NAME', ''))
url += quote(environ.get('PATH_INFO', ''))
...

@jgehrcke
Copy link
Contributor Author

@berkerpeksag yes, the problem is the "implicitly". It's not explicit, so it leaves some room for interpretation.

Obviously, certain applications require to be able to distinguish escaped from non-escaped entities in the original request path. That's why mod_wsgi introduced the configuration variable AllowEncodedSlashes.

Now, what would you propose? I try to summarize:

Heavily related discussions:
pallets/werkzeug#477
pallets/werkzeug#797
pallets/flask#900
http://www.leakon.com/archives/865

It seems like double-encoding is the most reliable solution to that for applications.

@benoitc
Copy link
Owner

benoitc commented Feb 24, 2016

application that need the RAW URI can get it from the environment variable RAW_URI in gunicorn. This how it's done since awhile. I would stick that way so we keep complying strictly to the WSGI spec. Imo we should have the same behaviour in all workers. Thoughts?

@jgehrcke
Copy link
Contributor Author

@benoitc thanks for the pointer towards RAW_URI! After getting a better overview about this topic I concur with you saying

I would stick that way so we keep complying strictly to the WSGI spec

So, that PR can be closed, I guess. I'm not exactly sure what you mean with

we should have the same behaviour in all workers

Is that not the case as of now?

Just FYI, in our application, we are now relying on clients/API consumers to replace slashes that are supposed to be transparent to the URL template engine of our WSGI application with a %252F. Gunicorn's wsgi.py guarantees to do one level of decoding (to %2F), and we can safely perform the second encoding step (to /) after app-level URL parsing.

@tilgovi
Copy link
Collaborator

tilgovi commented Feb 24, 2016

I think when @benoitc says that every worker should behave the same he's saying that the aio worker should do the same as the others.

Does that require aio-libs/aiohttp#343 to happen or is gunicorn unquoting before aiohttp see the request?

@tilgovi
Copy link
Collaborator

tilgovi commented Mar 13, 2016

Closing per #1211 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants