Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 49: Python 3 migration for wptserve #49

Merged
merged 8 commits into from
May 26, 2020
Merged

Conversation

Hexcles
Copy link
Member

@Hexcles Hexcles commented Apr 3, 2020

@Hexcles
Copy link
Member Author

Hexcles commented Apr 3, 2020

Tagging people who have been involved with Python 3 migration: @annevk @jgraham @stephenmcgruer @svillar @ziransun

@tabatkins
Copy link

I strongly suspect you're going to run into a lot of pernicious, annoying issues with this. You probably want to instead be stricter about always using bytes; any time I, and many other people with similar problems, have tried to be lazy about bytes vs strings, it's bitten us hard and been extremely annoying and frustrating.

I don't think there's a way to make Python3 emit bytes by default for untagged strings, but you can force Python2 to allow the b modifier, same as Py3. Unfortunately, that change causes untagged strings to switch to unicode, same as Py3, so it would imply some migration costs.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 4, 2020

Thanks for the feedback, Tab. Yes I'm aware of the potential issues (see the risks) and am trying to make conscious and careful trade-offs here. That said, I could well be missing some important points, which is exactly why I think the RFC process would be very valuable here!

For the core use case (reading and serving bytes as is on the wire), my investigation suggests that most custom handlers would work without modifications. wptserve is different from bikeshed as it doesn't interact with "text" much, so we could get away with treating arbitrary data as latin-1-encoded strings, even though the semantics are wrong (printing, counting the length, etc. are broken). (Of course, we can't be sure until we try.)

I was in favour of using bytes everywhere as I believed (and still believe) that'd be the "correct" way, e.g. in web-platform-tests/wpt#13246 , but I've become convinced that it'd require a herculean effort to migrate the hundreds of custom handlers and prevent future regression. As you said, there's no way to make unprefixed strings bytes in Py3 and the only way out would be to use the b prefix (which also works / is a no-op in Python 2) when needed (you can't just add b to every string literal). I don't think it's possible to check the usage programatically using lint or more advanced static analysis, which means that humans need to examine and modify the existing code (whereas in the recommended approach, we could fix just the handlers that throw and the tests that produce different results). If we didn't need to consider Python 2 compatibility, I'd probably still prefer this approach.

On top of the migration/maintenance cost, I also considered how the standard library and a bunch of popular HTTP-related libraries handle headers. All of them chose to use str instead of trying to ensure bytes. I feel like I was trying to fight the standard library, wrapping the API calls or even as @jgraham joked forking it.

@jgraham
Copy link
Contributor

jgraham commented Apr 4, 2020

I need to look more at this on Monday, but my initial reaction is that I'm very concerned with the proposal. I think using the str type is going to cause a bunch of hard-to-understand issues.

My proposal to "fork the stdlib" wasn't a joke, but also wasn't as far-reaching as that implied; it would specifically mean not using the stdlib parts we currently use for reading and writing HTTP requests specifically.

@annevk
Copy link
Member

annevk commented Apr 4, 2020

If we allow strings these "technically byte" APIs better throw for code points higher than U+00FF. I would favor using bytes though so it's clearer what's going on. (There are a lot of tests where the specifics around this matter.)

@Hexcles
Copy link
Member Author

Hexcles commented Apr 4, 2020

If we allow strings these "technically byte" APIs better throw for code points higher than U+00FF.

That's a good point. Eventually this would throw when calling encode('latin-1'):
UnicodeEncodeError: 'latin-1' codec can't encode character '\u01ff' in position 0: ordinal not in range(256)
But we could add assertions to make this easier to understand.

There are a lot of tests where the specifics around this matter.

If you look at these examples, would those suffice the testing purpose? My cursory scan of the code base suggests that the second form is the most common. Or does it confuse you cognitively that this is a text string in Python 3?

@annevk
Copy link
Member

annevk commented Apr 6, 2020

I'm probably not the best person to ask that question as I'm intimately familiar with HTTP and text encodings. ("latin-1" might be confusing to some as on the web it means windows-1252.)

@jgraham
Copy link
Contributor

jgraham commented Apr 6, 2020

OK, I've read this in more detail now, so I think I understand the tradeoffs being made, although I wouldn't say I yet have a good feeling for whether they're the right ones to make. The suggested approach is the exact opposite to the one we've been taking with all the internal code (where we have tried to be clear about what's Text and what's Bytes and use those types consistently between 2 and 3 where possible). But it's of course possible that the different constraints here wind up giving us a different optimal solution. Having said that I also think that our requirements are very different to normal HTTP libraries, so the prior art there isn't as relevant as it might seem.

I wonder if we can enumerate all the API surface we have that deal with strings and are interfacing with the stdlib. I don't think there are so many places where we directly expose the stdlib and in those places it may be possible to wrap-or-replace. So it at least seems conceviable that the concern that we might be able to mediate access and provide a better API.

I also don't think that linting for bytes-vs-text is the only possible way to enforce that we get the correct types; runtime asserts for example would cause tests to error and we could ensure that we get a useful output.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 6, 2020

I wonder if we can enumerate all the API surface we have that deal with strings and are interfacing with the stdlib.

With the help of typeshed, we should be able to get that list relatively easily.

I don't think there are so many places where we directly expose the stdlib and in those places it may be possible to wrap-or-replace. So it at least seems conceviable that the concern that we might be able to mediate access and provide a better API.

I agree that cleaning up our API surface wouldn't be hard, but my main concern is still around the custom handlers.

runtime asserts for example would cause tests to error and we could ensure that we get a useful output.

Is there a way to implement such asserts in Python 2? AFAIK, you can't actually tell the difference between b"..." and "..." at run time, which means we kinda have to bring up the whole Python 3 CI before being able to enforce such rules.

@tabatkins
Copy link

AFAIK, you can't actually tell the difference between b"..." and "..." at run time,

Correct; b or u just opt your string literal into one of the modes, and without them a plain literal chooses one according to python version and from __future__ imports; there's no way to telling whether a string was produced from a tagged literal or an untagged one.

@svillar
Copy link

svillar commented Apr 7, 2020

I was in favour of using bytes everywhere as I believed (and still believe) that'd be the "correct" way, e.g. in web-platform-tests/wpt#13246 , but I've become convinced that it'd require a herculean effort to migrate the hundreds of custom handlers and prevent future regression. As you said, there's no way to make unprefixed strings bytes in Py3 and the only way out would be to use the b prefix (which also works / is a no-op in Python 2) when needed (you can't just add b to every string literal). I don't think it's possible to check the usage programatically using lint or more advanced static analysis, which means that humans need to examine and modify the existing code (whereas in the recommended approach, we could fix just the handlers that throw and the tests that produce different results). If we didn't need to consider Python 2 compatibility, I'd probably still prefer this approach.

From my personal experience migrating the code in tools/, static analysis won't help in this case as you mentioned. Python is a dynamic language after all.

After reading your proposal I'm wondering whether the "herculean" effort you mention is about complexity or quantity. If we're talking about the former then I'd agree that making some trade-offs and KISS might be the right approach. But if the concern is the amount of required changes, then I think we should seriously consider the alternative of dealing always with bytes. These kind of "mechanical" changes allow great levels of parallelization (many people working on them at the same time) and after some time it'd be relatively easy to foresee the amount of time we'd need for the transition.

@jgraham
Copy link
Contributor

jgraham commented Apr 7, 2020

Is there a way to implement such asserts in Python 2? AFAIK, you can't actually tell the difference between b"..." and "..." at run time, which means we kinda have to bring up the whole Python 3 CI before being able to enforce such rules.

We can fully control the parsing of the code here. So we can, if we want, do things like fail the parse for unprefixed literals, or treat all literals as bytestrings unless otherwise specified (c.f. https://www.mercurial-scm.org/repo/hg/rev/1c22400db72d)

@Hexcles
Copy link
Member Author

Hexcles commented Apr 7, 2020

I'm probably not the best person to ask that question as I'm intimately familiar with HTTP and text encodings. ("latin-1" might be confusing to some as on the web it means windows-1252.)

@annevk Yeah we can use the more technically correct term iso-8859-1.

In addition to the scale, I'm concerned by the upkeeping after the massive manual conversion (which is challenging, but as @svillar pointed out, can be parallelized). I think if we were to do this, we would need to either hook into the loading mechanism to check for regression (see @jgraham 's comment above, but probably not overwrite) or set up a full Python 3 CI to compare (a subset of) results (which requires us to get almost everything else working in Python 3 first). @svillar @ziransun IIRC, we can already run wpt run and produce the JSON in Python 3 now? I'd very much prefer running Python 3 as opposed to hooking into the loading mechanism.

@stephenmcgruer
Copy link
Contributor

produce the JSON in Python 3 now

What do you mean by 'produce the JSON'?

@Hexcles
Copy link
Member Author

Hexcles commented Apr 7, 2020

Ahh sorry I meant wptreport.json, which would allow us to compare results against Python 2.

@jgraham
Copy link
Contributor

jgraham commented Apr 7, 2020

Note that we don't have to do the hg thing of actually altering the source, we can do something like tokenize the source using the built-in tokenize module to validate that all strings are specified with a prefix. We can also make that opt-in during the transition with a magic comment that we remove once we've converted all the existing callers. That means we can make things fail before tests even reach CI which reduces the number of cases where we have to explain how to interpret CI results.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 7, 2020

I think we (including me) don't have enough concrete evidence (e.g. regarding how much work the bytes conversion in custom handlers would require), and it's clear from the discussions so far that most people prefer to have clear and correct semantics, so I'm proposing that we try the "alternative approach", concretely:

  • I will double check the wptserve APIs to make sure they always return bytes (note that this is basically the status quo, so I may fix the API which doesn't currently always return bytes without landing a RFC).
  • I will propose a conversion guideline for custom handlers.
  • We then try converting a bunch of custom handlers. I'm thinking maybe html/, in other words try python 3: Add python 3 support for html tests wpt#22402 again with the new guideline, if @annevk doesn't mind reviewing, or some other people volunteer.

At that point, we can consider how we'd like to proceed. We'd need both consensus around the approach and commitment to work (writing and reviewing the PRs).

I consider this a relatively low-cost experiment and we can always come back to the other approach. WDYT?

@jgraham
Copy link
Contributor

jgraham commented Apr 7, 2020

I'm happy with that plan. I think we do need a way to enforce that we get correct semantics going forward; presumably this comes under "conversion guidelines" (I also agree it may be premature to actually implement the enforcement mechansim until such a time as there's a clear decision on which path to take).

@stephenmcgruer
Copy link
Contributor

This plan SGTM, with one caveat/nit-pick:

We then try converting a bunch of custom handlers. I'm thinking maybe html/, in other words try web-platform-tests/wpt#22402 again with the new guideline, if @annevk doesn't mind reviewing, or some other people volunteer.

I am concerned that html/ is too large for a trial run, since the 'only accept bytes' approach requires us to examine and convert every single file. By my rough count, that's 50 files in html/:

$ find html -name "*.py" -not -path "*/.virtualenv/*" | wc -l
50

To me a trial would be more like: pick 5-10 Python handler files at random, and try those.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 7, 2020

I clearly did not count the files :) I thought the PR included everything in html. I was also thinking that trial should be about the same size as that PR (<10 files).

@annevk
Copy link
Member

annevk commented Apr 8, 2020

@Hexcles iso-8859-1 means windows-1252 on the web too. See https://encoding.spec.whatwg.org/. In standards we use https://infra.spec.whatwg.org/#isomorphic-decode and https://infra.spec.whatwg.org/#isomorphic-encode.

Happy to help review.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 27, 2020

Thanks, @annevk . I'll use "isomorphic decode/encode"; we don't really care about the character sets.

Sorry for the delay. I've done a cursory check of wptserve and found few issues (I'll send some PRs later). Keys and values in headers are always byte sequences (though the getter accepts both text and byte strings, so headers.get("cookies") works in both Python versions).

I think we can start experimenting converting a few custom handlers (I suggest we try those in web-platform-tests/wpt#22402 first), @ziransun @stephenmcgruer . I'm proposing the following guidelines:

  • Headers: always use binary strings for both keys are values; prefer adding b"" prefixes to encoding/decoding the other operands. This also applies to basic auth (which is constructed from headers): request.headers.auth.{username,password} are byte strings.
  • Request params: both keys and values for GET and POST params should be text strings binary strings (percent-decoded but not string decoded). Note that they are currently native strings and work is needed to make them binary (blocked by [wptserve] Make sure GET/POST params are always binary wpt#23284).
  • Response body: always use text strings. The writer takes both text and binary strings, so native strings would technically work most of the time but we want to prefix everything to get consistent semantics. The rules are:
    • If the string contains only ASCII characters, either type will work. Choose the one that requires less encoding/decoding (e.g. if the string is constructed from headers, then use bytes; if the string is constructed from query params, then use text).
    • If the string contains escaped code points (e.g. “\x0F”, \012”), use bytes.
    • If the string contains non-ASCII characters, use text. Note that the Response class will use UTF-8 to encode the output by default. If a handler needs other encoding, it probably already has explicitly set response.encoding.

Bottom line: make sure all strings are either always text or always bytes; all string literals in handlers should be prefixed with b"" or u"".

@stephenmcgruer
Copy link
Contributor

Thanks @Hexcles ! How about you and I work today to put together a short conversion guideline?

I suggest we try those in web-platform-tests/wpt#22402 first

To be clear, my reading of this is that you suggest we try the exact 8 files touched by that PR (as opposed to all the handlers in html/).

@jgraham
Copy link
Contributor

jgraham commented Apr 27, 2020

Response body: always use text strings. The Response class defaults to utf-8. If the handler needs other encoding, it probably already has explicitly specified the encoding.

Unless I'm missing something, that's not quite true; if we're given binary data in the Response class we pass it through. That seems like reasonable and sensible behaviour we should keep, but it does mean things could be different between py2 and py3 unless we force people to annotate strings with either u or b.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 27, 2020

@jgraham I stand corrected. I've updated the guidelines above. Any other concerns?


On a second thought, I'm not sure whether request params should be text. These percent-encoded strings "should not cause UTF-8 decode without BOM or fail to return failure" and "using anything but UTF-8 decode without BOM when the input contains bytes that are not ASCII bytes might be insecure and is not recommended." (https://url.spec.whatwg.org/#percent-encoded-bytes) But I suppose tests might want to verify exactly that, so it makes sense to return binary bytes, too.

@annevk
Copy link
Member

annevk commented Apr 29, 2020

For representing the value of the name query parameter in the URL /?name=%E2%82%AC either b"\xE2\x82\xAC" or u"\u00E2\u0082\u00AC" are reasonable, but u"\u20AC" (result from UTF-8 decode) would not be. b"\x80" would be a bit clearer as in the u case the upper bound for a code point would have to be \u00FF. Hope that helps.

@annevk
Copy link
Member

annevk commented Apr 29, 2020

@Hexcles there's a dedicated cookie API for Set-Cookie? If so, it definitely would be most useful if that can set any kind of header value as otherwise a lot of cookie-related tests would simply have to use the headers API.

@Hexcles
Copy link
Member Author

Hexcles commented Apr 29, 2020

@annevk Thanks! I don't like u"\u00E2\u0082\u00AC" so I'll go with b"\xE2\x82\xAC".

There's https://web-platform-tests.org/tools/wptserve/docs/request.html#wptserve.request.Request.cookies for dealing with cookies.

ziransun added a commit to ziransun/wpt that referenced this pull request May 1, 2020
…orting guide

This is a rework on web-platform-tests#22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49
ziransun added a commit to ziransun/wpt that referenced this pull request May 4, 2020
…orting guide

This is a rework on web-platform-tests#22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49
@Hexcles
Copy link
Member Author

Hexcles commented May 5, 2020

I've merged web-platform-tests/wpt#23284 and web-platform-tests/wpt#23327 . Now almost everything in wptserve is binary string, with the notable exception of URLs/paths.

@ziransun 's web-platform-tests/wpt#23363 (not ready for review yet) will give us an idea of how the migration will look like.

ziransun added a commit to ziransun/wpt that referenced this pull request May 7, 2020
…orting guide

This is a rework on web-platform-tests#22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49
@stephenmcgruer
Copy link
Contributor

stephenmcgruer commented May 7, 2020

As I understand it, web-platform-tests/wpt#23363 is now ready for review and should be able to give an idea of what the outcome of this RFC would look like. Please take a look, all.

@Hexcles
Copy link
Member Author

Hexcles commented May 7, 2020

what the outcome of this RFC would look like

Minor correction: what the outcome of the "alternative" approach in the RFC as of now would look like (I haven't updated the RFC yet)

ziransun added a commit to ziransun/wpt that referenced this pull request May 11, 2020
…orting guide

This is a rework on web-platform-tests#22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49
@Hexcles
Copy link
Member Author

Hexcles commented May 15, 2020

Hi everyone on this thread, I've updated the RFC to reflect our latest consensus, switching the previously recommended and alternative approaches. I also incorporated the guidelines from the earlier comment, which has been tested in this trail PR, into the RFC.

Please take a look again. Thanks!

Copy link
Contributor

@stephenmcgruer stephenmcgruer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval modulo editorial nits. I didn't review the alternative section in detail.

rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
rfcs/wptserve_py3.md Outdated Show resolved Hide resolved
Copy link
Member

@foolip foolip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read it over again. We're in for a lot of work, so let's roll our sleeves up.

@foolip
Copy link
Member

foolip commented May 16, 2020

@Hexcles we discussed a method for comparing the the responses of custom handlers between python 2 and 3. Is a script for that something you expect to have in place when we begin making/reviewing changes?

@Hexcles
Copy link
Member Author

Hexcles commented May 18, 2020

@foolip I'll look into that in parallel, but don't want to wait for / block on that idea.

@stephenmcgruer
Copy link
Contributor

As far as I know, we are now at 10 days without any objections; per the RFC process I am going to merge this later this afternoon.

@stephenmcgruer stephenmcgruer merged commit 203c3c1 into master May 26, 2020
@stephenmcgruer stephenmcgruer deleted the wptserve-py3 branch May 26, 2020 11:04
stephenmcgruer pushed a commit to web-platform-tests/wpt that referenced this pull request May 26, 2020
… guide (#23363)

This is a rework on #22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49

Co-authored-by: Robert Ma <[email protected]>
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request May 30, 2020
… Python file handlers p…, a=testonly

Automatic update from web-platform-tests
Python 3: Port some html tests following Python file handlers porting guide (#23363)

This is a rework on web-platform-tests/wpt#22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49

Co-authored-by: Robert Ma <[email protected]>
--

wpt-commits: e60c41610959f414e24fe2360151d94953436906
wpt-pr: 23363
xeonchen pushed a commit to xeonchen/gecko that referenced this pull request May 30, 2020
… Python file handlers p…, a=testonly

Automatic update from web-platform-tests
Python 3: Port some html tests following Python file handlers porting guide (#23363)

This is a rework on web-platform-tests/wpt#22402
This change follows the porting guide discussed in RFC49:
web-platform-tests/rfcs#49

Co-authored-by: Robert Ma <[email protected]>
--

wpt-commits: e60c41610959f414e24fe2360151d94953436906
wpt-pr: 23363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants