-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch and nice fixes.
@@ -597,6 +597,8 @@ func (seq Seq) UniqueElements() Set { | |||
func (seq Seq) String() string { | |||
if seq.isString { | |||
s := "" | |||
// Note this doesn't produce the right string in UTF-8, | |||
// since it converts surrogates independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, just one I didn't want to bring in scope - I wasn't confident it was reasonable to depend on the necessary Go package to decode UTF-16. But I'll at least cut an issue (plus C++ is an even bigger mess since it uses the 8-bit char
type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is back to "it might not print the right thing but at least it won't crash, and the output of print isn't part of our "contract"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. And unfortunately the more I test the more I find ugly inconsistencies on corner cases. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wr.Write("_dafny.Char("); | ||
// See comment on the StringLiteralExpr case below. | ||
if (Util.Utf16Escape.IsMatch(v)) { | ||
char c = Util.UnescapedCharacters(v, false).Single(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it always at most one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser should be guaranteeing its exactly one. We do something equivalent in the translator already: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Verifier/Translator.ExpressionTranslator.cs#L306-L309
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, thanks. Single
will raise an exception if there's one than one element, right? So, nice safety check actually.
// but Dafny allows invalid sequences of surrogate characters. | ||
// So if any are present, just emit a sequence of the direct UTF-16 code units instead. | ||
var s = (string)str.Value; | ||
if (!str.IsVerbatim && Util.Utf16Escape.IsMatch(s)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the C# string contains unpaired surrogates (not \u
escaped, just directly, unescaped, in the string?) Is that impossible thanks to the way we parse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It SHOULD be impossible because Dafny source files have to be UTF-8 encoded, and the scanner is decoding to int
code points: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Coco/Scanner.frame#L210
That code doesn't look particularly robust in the face of invalid UTF-8 sequences mind you, but it doesn't look like it's possible to produce surrogate values.
There's also the fact that the generated scanner code includes a straight cast from int
to char
, which could truncate a code point to 0xFFFF
. I've got a fix for that in the unicode char branch (and I should probably cut an issue for it right away) but that still won't create surrogates at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sweet. I'll let you judge whether it's worth a Contract.Assert here — it might save our bacon down the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there's nothing to assert here: I'm specifically NOT assuming surrogates are used correctly here. The only concern is accurately translating escape sequences to target language escape sequences (or something else equivalent). I'm not aware of any bugs in translating arbitrary char
values in the string at least.
foreach (var c in Util.UnescapedCharacters(s, str.IsVerbatim)) { | ||
wr.Write(comma); | ||
wr.Write($"{(int)c}"); | ||
comma = ", "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had an API to join, but maybe not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did too. There's Util.Comma
but it only produces strings rather than working with ConcreteSyntaxTree
s.
|
||
|
||
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) { | |
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) { |
Source/DafnyCore/Util.cs
Outdated
@@ -195,6 +196,9 @@ public static class Util { | |||
UnescapedCharacters(s, isVerbatimString).Iter(ch => sb.Append(ch)); | |||
return sb.ToString(); | |||
} | |||
|
|||
public static readonly Regex Utf16Escape = new Regex(@"(?<!\\)\\u([0-9a-fA-F]{4})"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right: how about \\\uAAAA
? It won't match, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yup - I copied this from the existing ShortOctalEscape
and ShortHexEscape
patterns which are also wrong. :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that both of those other bogus regex patterns will be removed entirely in the fix to #2928, since they are trying to do entirely the wrong thing anyway.
Co-authored-by: Clément Pit-Claudel <[email protected]>
Co-authored-by: Clément Pit-Claudel <[email protected]>
Source/DafnyCore/Util.cs
Outdated
/// For example, "ab\tuv\u12345" may be broken up as ["a", "b", "\t", "u", "v", "\u1234", "5"]. | ||
/// Consecutive non-escaped characters may or may not be enumerated as a single string. | ||
/// </summary> | ||
public static IEnumerable<string> Escapes(string p, bool isVerbatimString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better name for this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TokenizeEscapedString?
// Ensuring we're precise enough about identifying \u escapes | ||
print "I'm afraid you'll find escape quite impossible, \\u007", "\n"; | ||
print "Luckily I have this nifty gadget from my good friend, \\\u0051", "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callback to my favourite test case ever: https://github.com/dafny-lang/dafny/blob/master/Test/expectations/Expect.dfy#L19-L22 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good!
At least now I can use %testDafnyForEachCompiler!
Confirming my suspicion that the problem is about feeding the JS program into node as stdin
If nothing else our test runner doesn’t seem to support them and will need more work.
@@ -2276,7 +2276,7 @@ private class GenericArrayElementLvalue : ILvalue { | |||
files.Add($"\"{Path.GetFullPath(file)}\""); | |||
} | |||
var classpath = GetClassPath(targetFilename); | |||
var psi = new ProcessStartInfo("javac", string.Join(" ", files)) { | |||
var psi = new ProcessStartInfo("javac", "-encoding UTF8 " + string.Join(" ", files)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now but we should check what encoding we use for writing.
Fixes #1980. Fixes #2925.
Should fix the root cause of #2934, but I'm not claiming the complete fix, as attempting to include unescaped, non-ASCII characters in a Dafny source file is revealing platform-specific headaches in the integration test runner I'd like to address in a separate PR.
Printing non-ASCII characters, especially invalid UTF-16 sequences, is still inconsistent across backends, but at least should not crash after this change. Again this is difficult to test effectively across platforms in our testing architecture.
By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.