-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 49: Python 3 migration for wptserve #49
Conversation
Tagging people who have been involved with Python 3 migration: @annevk @jgraham @stephenmcgruer @svillar @ziransun |
I strongly suspect you're going to run into a lot of pernicious, annoying issues with this. You probably want to instead be stricter about always using bytes; any time I, and many other people with similar problems, have tried to be lazy about bytes vs strings, it's bitten us hard and been extremely annoying and frustrating. I don't think there's a way to make Python3 emit bytes by default for untagged strings, but you can force Python2 to allow the |
Thanks for the feedback, Tab. Yes I'm aware of the potential issues (see the risks) and am trying to make conscious and careful trade-offs here. That said, I could well be missing some important points, which is exactly why I think the RFC process would be very valuable here! For the core use case (reading and serving bytes as is on the wire), my investigation suggests that most custom handlers would work without modifications. wptserve is different from bikeshed as it doesn't interact with "text" much, so we could get away with treating arbitrary data as latin-1-encoded strings, even though the semantics are wrong (printing, counting the length, etc. are broken). (Of course, we can't be sure until we try.) I was in favour of using bytes everywhere as I believed (and still believe) that'd be the "correct" way, e.g. in web-platform-tests/wpt#13246 , but I've become convinced that it'd require a herculean effort to migrate the hundreds of custom handlers and prevent future regression. As you said, there's no way to make unprefixed strings bytes in Py3 and the only way out would be to use the On top of the migration/maintenance cost, I also considered how the standard library and a bunch of popular HTTP-related libraries handle headers. All of them chose to use |
I need to look more at this on Monday, but my initial reaction is that I'm very concerned with the proposal. I think using the My proposal to "fork the stdlib" wasn't a joke, but also wasn't as far-reaching as that implied; it would specifically mean not using the stdlib parts we currently use for reading and writing HTTP requests specifically. |
If we allow strings these "technically byte" APIs better throw for code points higher than U+00FF. I would favor using bytes though so it's clearer what's going on. (There are a lot of tests where the specifics around this matter.) |
That's a good point. Eventually this would throw when calling
If you look at these examples, would those suffice the testing purpose? My cursory scan of the code base suggests that the second form is the most common. Or does it confuse you cognitively that this is a text string in Python 3? |
I'm probably not the best person to ask that question as I'm intimately familiar with HTTP and text encodings. ("latin-1" might be confusing to some as on the web it means windows-1252.) |
OK, I've read this in more detail now, so I think I understand the tradeoffs being made, although I wouldn't say I yet have a good feeling for whether they're the right ones to make. The suggested approach is the exact opposite to the one we've been taking with all the internal code (where we have tried to be clear about what's Text and what's Bytes and use those types consistently between 2 and 3 where possible). But it's of course possible that the different constraints here wind up giving us a different optimal solution. Having said that I also think that our requirements are very different to normal HTTP libraries, so the prior art there isn't as relevant as it might seem. I wonder if we can enumerate all the API surface we have that deal with strings and are interfacing with the stdlib. I don't think there are so many places where we directly expose the stdlib and in those places it may be possible to wrap-or-replace. So it at least seems conceviable that the concern that we might be able to mediate access and provide a better API. I also don't think that linting for bytes-vs-text is the only possible way to enforce that we get the correct types; runtime asserts for example would cause tests to error and we could ensure that we get a useful output. |
With the help of typeshed, we should be able to get that list relatively easily.
I agree that cleaning up our API surface wouldn't be hard, but my main concern is still around the custom handlers.
Is there a way to implement such asserts in Python 2? AFAIK, you can't actually tell the difference between |
Correct; |
From my personal experience migrating the code in After reading your proposal I'm wondering whether the "herculean" effort you mention is about complexity or quantity. If we're talking about the former then I'd agree that making some trade-offs and KISS might be the right approach. But if the concern is the amount of required changes, then I think we should seriously consider the alternative of dealing always with bytes. These kind of "mechanical" changes allow great levels of parallelization (many people working on them at the same time) and after some time it'd be relatively easy to foresee the amount of time we'd need for the transition. |
We can fully control the parsing of the code here. So we can, if we want, do things like fail the parse for unprefixed literals, or treat all literals as bytestrings unless otherwise specified (c.f. https://www.mercurial-scm.org/repo/hg/rev/1c22400db72d) |
@annevk Yeah we can use the more technically correct term In addition to the scale, I'm concerned by the upkeeping after the massive manual conversion (which is challenging, but as @svillar pointed out, can be parallelized). I think if we were to do this, we would need to either hook into the loading mechanism to check for regression (see @jgraham 's comment above, but probably not overwrite) or set up a full Python 3 CI to compare (a subset of) results (which requires us to get almost everything else working in Python 3 first). @svillar @ziransun IIRC, we can already run |
What do you mean by 'produce the JSON'? |
Ahh sorry I meant |
Note that we don't have to do the hg thing of actually altering the source, we can do something like tokenize the source using the built-in |
I think we (including me) don't have enough concrete evidence (e.g. regarding how much work the bytes conversion in custom handlers would require), and it's clear from the discussions so far that most people prefer to have clear and correct semantics, so I'm proposing that we try the "alternative approach", concretely:
At that point, we can consider how we'd like to proceed. We'd need both consensus around the approach and commitment to work (writing and reviewing the PRs). I consider this a relatively low-cost experiment and we can always come back to the other approach. WDYT? |
How ironic...
I'm happy with that plan. I think we do need a way to enforce that we get correct semantics going forward; presumably this comes under "conversion guidelines" (I also agree it may be premature to actually implement the enforcement mechansim until such a time as there's a clear decision on which path to take). |
This plan SGTM, with one caveat/nit-pick:
I am concerned that
To me a trial would be more like: pick 5-10 Python handler files at random, and try those. |
I clearly did not count the files :) I thought the PR included everything in html. I was also thinking that trial should be about the same size as that PR (<10 files). |
@Hexcles iso-8859-1 means windows-1252 on the web too. See https://encoding.spec.whatwg.org/. In standards we use https://infra.spec.whatwg.org/#isomorphic-decode and https://infra.spec.whatwg.org/#isomorphic-encode. Happy to help review. |
Thanks, @annevk . I'll use "isomorphic decode/encode"; we don't really care about the character sets. Sorry for the delay. I've done a cursory check of I think we can start experimenting converting a few custom handlers (I suggest we try those in web-platform-tests/wpt#22402 first), @ziransun @stephenmcgruer . I'm proposing the following guidelines:
Bottom line: make sure all strings are either always text or always bytes; all string literals in handlers should be prefixed with |
Thanks @Hexcles ! How about you and I work today to put together a short conversion guideline?
To be clear, my reading of this is that you suggest we try the exact 8 files touched by that PR (as opposed to all the handlers in |
Unless I'm missing something, that's not quite true; if we're given binary data in the |
@jgraham I stand corrected. I've updated the guidelines above. Any other concerns? On a second thought, I'm not sure whether request params should be text. These percent-encoded strings "should not cause UTF-8 decode without BOM or fail to return failure" and "using anything but UTF-8 decode without BOM when the input contains bytes that are not ASCII bytes might be insecure and is not recommended." (https://url.spec.whatwg.org/#percent-encoded-bytes) But I suppose tests might want to verify exactly that, so it makes sense to return binary bytes, too. |
For representing the value of the name query parameter in the URL |
@Hexcles there's a dedicated cookie API for |
@annevk Thanks! I don't like There's https://web-platform-tests.org/tools/wptserve/docs/request.html#wptserve.request.Request.cookies for dealing with cookies. |
…orting guide This is a rework on web-platform-tests#22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49
…orting guide This is a rework on web-platform-tests#22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49
I've merged web-platform-tests/wpt#23284 and web-platform-tests/wpt#23327 . Now almost everything in wptserve is binary string, with the notable exception of URLs/paths. @ziransun 's web-platform-tests/wpt#23363 (not ready for review yet) will give us an idea of how the migration will look like. |
…orting guide This is a rework on web-platform-tests#22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49
As I understand it, web-platform-tests/wpt#23363 is now ready for review and should be able to give an idea of what the outcome of this RFC would look like. Please take a look, all. |
Minor correction: what the outcome of the "alternative" approach in the RFC as of now would look like (I haven't updated the RFC yet) |
…orting guide This is a rework on web-platform-tests#22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approval modulo editorial nits. I didn't review the alternative section in detail.
Co-authored-by: Stephen McGruer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read it over again. We're in for a lot of work, so let's roll our sleeves up.
@Hexcles we discussed a method for comparing the the responses of custom handlers between python 2 and 3. Is a script for that something you expect to have in place when we begin making/reviewing changes? |
@foolip I'll look into that in parallel, but don't want to wait for / block on that idea. |
As far as I know, we are now at 10 days without any objections; per the RFC process I am going to merge this later this afternoon. |
… guide (#23363) This is a rework on #22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49 Co-authored-by: Robert Ma <[email protected]>
… Python file handlers p…, a=testonly Automatic update from web-platform-tests Python 3: Port some html tests following Python file handlers porting guide (#23363) This is a rework on web-platform-tests/wpt#22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49 Co-authored-by: Robert Ma <[email protected]> -- wpt-commits: e60c41610959f414e24fe2360151d94953436906 wpt-pr: 23363
… Python file handlers p…, a=testonly Automatic update from web-platform-tests Python 3: Port some html tests following Python file handlers porting guide (#23363) This is a rework on web-platform-tests/wpt#22402 This change follows the porting guide discussed in RFC49: web-platform-tests/rfcs#49 Co-authored-by: Robert Ma <[email protected]> -- wpt-commits: e60c41610959f414e24fe2360151d94953436906 wpt-pr: 23363
Rendered