Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926
fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926
Changes from 4 commits
9a6face
ddd52cc
d5ba377
396f466
4b258f2
ec6f2da
a81cea9
62e0f2e
1c2b871
2c217ae
057bf25
7e3d243
eccc1e6
9670c7f
212f3ee
79d2f11
2dba4e3
86f2b89
58a63ef
d7bdece
fb913ef
f50dc60
f8c402f
2b2900a
4db28b2
6a4e880
72cbcd8
fc82fbb
67da219
3861ee1
09f6276
f63d9b8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it always at most one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser should be guaranteeing its exactly one. We do something equivalent in the translator already: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Verifier/Translator.ExpressionTranslator.cs#L306-L309
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, thanks.
Single
will raise an exception if there's one than one element, right? So, nice safety check actually.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the C# string contains unpaired surrogates (not
\u
escaped, just directly, unescaped, in the string?) Is that impossible thanks to the way we parse?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It SHOULD be impossible because Dafny source files have to be UTF-8 encoded, and the scanner is decoding to
int
code points: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Coco/Scanner.frame#L210That code doesn't look particularly robust in the face of invalid UTF-8 sequences mind you, but it doesn't look like it's possible to produce surrogate values.
There's also the fact that the generated scanner code includes a straight cast from
int
tochar
, which could truncate a code point to0xFFFF
. I've got a fix for that in the unicode char branch (and I should probably cut an issue for it right away) but that still won't create surrogates at least.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sweet. I'll let you judge whether it's worth a Contract.Assert here — it might save our bacon down the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's nothing to assert here: I'm specifically NOT assuming surrogates are used correctly here. The only concern is accurately translating escape sequences to target language escape sequences (or something else equivalent). I'm not aware of any bugs in translating arbitrary
char
values in the string at least.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had an API to join, but maybe not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did too. There's
Util.Comma
but it only produces strings rather than working withConcreteSyntaxTree
s.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right: how about
\\\uAAAA
? It won't match, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yup - I copied this from the existing
ShortOctalEscape
andShortHexEscape
patterns which are also wrong. :PThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that both of those other bogus regex patterns will be removed entirely in the fix to #2928, since they are trying to do entirely the wrong thing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, just one I didn't want to bring in scope - I wasn't confident it was reasonable to depend on the necessary Go package to decode UTF-16. But I'll at least cut an issue (plus C++ is an even bigger mess since it uses the 8-bit
char
type).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is back to "it might not print the right thing but at least it won't crash, and the output of print isn't part of our "contract"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. And unfortunately the more I test the more I find ugly inconsistencies on corner cases. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2929