fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

robin-aws · 2022-10-25T23:35:00Z

Fixes #1980. Fixes #2925.

Should fix the root cause of #2934, but I'm not claiming the complete fix, as attempting to include unescaped, non-ASCII characters in a Dafny source file is revealing platform-specific headaches in the integration test runner I'd like to address in a separate PR.

Printing non-ASCII characters, especially invalid UTF-16 sequences, is still inconsistent across backends, but at least should not crash after this change. Again this is difficult to test effectively across platforms in our testing architecture.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

cpitclaudel

Good catch and nice fixes.

Source/DafnyRuntime/DafnyRuntime.py

cpitclaudel · 2022-10-25T23:48:46Z

Source/DafnyRuntime/DafnyRuntime.go

@@ -597,6 +597,8 @@ func (seq Seq) UniqueElements() Set {
 func (seq Seq) String() string {
  if seq.isString {
    s := ""
+    // Note this doesn't produce the right string in UTF-8,
+    // since it converts surrogates independently.


Isn't that a bug?

Definitely, just one I didn't want to bring in scope - I wasn't confident it was reasonable to depend on the necessary Go package to decode UTF-16. But I'll at least cut an issue (plus C++ is an even bigger mess since it uses the 8-bit char type).

I guess this is back to "it might not print the right thing but at least it won't crash, and the output of print isn't part of our "contract"?

Yup. And unfortunately the more I test the more I find ugly inconsistencies on corner cases. :(

cpitclaudel · 2022-10-25T23:49:17Z

Source/DafnyCore/Compilers/Compiler-go.cs

+        wr.Write("_dafny.Char(");
+        // See comment on the StringLiteralExpr case below.
+        if (Util.Utf16Escape.IsMatch(v)) {
+          char c = Util.UnescapedCharacters(v, false).Single();


Why is it always at most one?

The parser should be guaranteeing its exactly one. We do something equivalent in the translator already: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Verifier/Translator.ExpressionTranslator.cs#L306-L309

Sweet, thanks. Single will raise an exception if there's one than one element, right? So, nice safety check actually.

cpitclaudel · 2022-10-25T23:51:17Z

Source/DafnyCore/Compilers/Compiler-go.cs

+        // but Dafny allows invalid sequences of surrogate characters.
+        // So if any are present, just emit a sequence of the direct UTF-16 code units instead.
+        var s = (string)str.Value;
+        if (!str.IsVerbatim && Util.Utf16Escape.IsMatch(s)) {


What happens if the C# string contains unpaired surrogates (not \u escaped, just directly, unescaped, in the string?) Is that impossible thanks to the way we parse?

It SHOULD be impossible because Dafny source files have to be UTF-8 encoded, and the scanner is decoding to int code points: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Coco/Scanner.frame#L210

That code doesn't look particularly robust in the face of invalid UTF-8 sequences mind you, but it doesn't look like it's possible to produce surrogate values.

There's also the fact that the generated scanner code includes a straight cast from int to char, which could truncate a code point to 0xFFFF. I've got a fix for that in the unicode char branch (and I should probably cut an issue for it right away) but that still won't create surrogates at least.

OK, sweet. I'll let you judge whether it's worth a Contract.Assert here — it might save our bacon down the line?

Actually there's nothing to assert here: I'm specifically NOT assuming surrogates are used correctly here. The only concern is accurately translating escape sequences to target language escape sequences (or something else equivalent). I'm not aware of any bugs in translating arbitrary char values in the string at least.

cpitclaudel · 2022-10-25T23:51:44Z

Source/DafnyCore/Compilers/Compiler-go.cs

+          foreach (var c in Util.UnescapedCharacters(s, str.IsVerbatim)) {
+            wr.Write(comma);
+            wr.Write($"{(int)c}");
+            comma = ", ";


I thought we had an API to join, but maybe not?

Yeah I did too. There's Util.Comma but it only produces strings rather than working with ConcreteSyntaxTrees.

Source/DafnyCore/Compilers/Compiler-go.cs

cpitclaudel · 2022-10-25T23:52:06Z

Source/DafnyCore/Compilers/Compiler-go.cs

+
+
    protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {


Suggested change

protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {

protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {

cpitclaudel · 2022-10-25T23:53:11Z

Source/DafnyCore/Util.cs

@@ -195,6 +196,9 @@ public static class Util {
      UnescapedCharacters(s, isVerbatimString).Iter(ch => sb.Append(ch));
      return sb.ToString();
    }
+
+    public static readonly Regex Utf16Escape = new Regex(@"(?<!\\)\\u([0-9a-fA-F]{4})");


This doesn't look right: how about \\\uAAAA? It won't match, right?

Ha, yup - I copied this from the existing ShortOctalEscape and ShortHexEscape patterns which are also wrong. :P

Note that both of those other bogus regex patterns will be removed entirely in the fix to #2928, since they are trying to do entirely the wrong thing anyway.

…issue-1890

Co-authored-by: Clément Pit-Claudel <[email protected]>

robin-aws · 2022-10-26T20:39:18Z

Source/DafnyCore/Util.cs

+    /// For example, "ab\tuv\u12345" may be broken up as ["a", "b", "\t", "u", "v", "\u1234", "5"].
+    /// Consecutive non-escaped characters may or may not be enumerated as a single string.
+    /// </summary>
+    public static IEnumerable<string> Escapes(string p, bool isVerbatimString) {


Better name for this method?

TokenizeEscapedString?

robin-aws · 2022-10-26T20:41:00Z

Test/dafny0/Strings.dfy

+  // Ensuring we're precise enough about identifying \u escapes
+  print "I'm afraid you'll find escape quite impossible, \\u007", "\n";
+  print "Luckily I have this nifty gadget from my good friend, \\\u0051", "\n";


Callback to my favourite test case ever: https://github.com/dafny-lang/dafny/blob/master/Test/expectations/Expect.dfy#L19-L22 :)

…issue-1890

cpitclaudel

Looks pretty good!

…-issue-1890

… now)

…issue-1890

…-issue-1890

…issue-1890

At least now I can use %testDafnyForEachCompiler!

Confirming my suspicion that the problem is about feeding the JS program into node as stdin

If nothing else our test runner doesn’t seem to support them and will need more work.

cpitclaudel · 2022-11-03T18:42:25Z

Source/DafnyCore/Compilers/Compiler-java.cs

@@ -2276,7 +2276,7 @@ private class GenericArrayElementLvalue : ILvalue {
        files.Add($"\"{Path.GetFullPath(file)}\"");
      }
      var classpath = GetClassPath(targetFilename);
-      var psi = new ProcessStartInfo("javac", string.Join(" ", files)) {
+      var psi = new ProcessStartInfo("javac", "-encoding UTF8 " + string.Join(" ", files)) {


This is fine for now but we should check what encoding we use for writing.

robin-aws added 2 commits October 25, 2022 11:36

Adding surrogate edge cases to Strings.dfy

9a6face

Handle UTF-16 \u escapes in Python and Go (mostly)

ddd52cc

robin-aws requested a review from cpitclaudel October 25, 2022 23:37

robin-aws added 2 commits October 25, 2022 16:38

Whitespace

d5ba377

Merge branch 'master' into git-issue-1890

396f466

robin-aws changed the title ~~Handle UTF-16 escapes and invalid surrogate sequences in Python and Go~~ fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go Oct 25, 2022

cpitclaudel requested changes Oct 25, 2022

View reviewed changes

robin-aws and others added 5 commits October 26, 2022 13:21

PR feedback, especially replacing bogus regex

4b258f2

Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…

ec6f2da

…issue-1890

Apply suggestions from code review

a81cea9

Co-authored-by: Clément Pit-Claudel <[email protected]>

Update Source/DafnyRuntime/DafnyRuntime.py

62e0f2e

Co-authored-by: Clément Pit-Claudel <[email protected]>

Whitespace

1c2b871

robin-aws commented Oct 26, 2022

View reviewed changes

Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…

2c217ae

…issue-1890

robin-aws mentioned this pull request Oct 26, 2022

Go prints surrogate pairs incorrectly #2929

Open

robin-aws added the run-deep-tests Tells CI to run all tests label Oct 26, 2022

Poke CI

057bf25

cpitclaudel previously approved these changes Oct 27, 2022

View reviewed changes

Method rename, hopefully fixing Java testing flakiness

7e3d243

robin-aws dismissed cpitclaudel’s stale review via 7e3d243 October 27, 2022 18:12

robin-aws added 2 commits October 27, 2022 11:56

Merge branch 'master' of https://github.com/dafny-lang/dafny into git…

eccc1e6

…-issue-1890

Handle direct non-ASCII characters as well, release note

9670c7f

cpitclaudel mentioned this pull request Oct 27, 2022

Encoding mismatches result in runtime crashes in customer code compiled to Go #2934

Closed

robin-aws added 6 commits October 31, 2022 09:57

Merge branch 'master' into git-issue-1890

212f3ee

Revert attempt to fix Java stdout encoding (since Clement’s fix is in…

79d2f11

… now)

Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…

2dba4e3

…issue-1890

Dumb edit typo

86f2b89

Update allocated1 version of Strings.dfy

58a63ef

Merge branch 'master' of https://github.com/dafny-lang/dafny into git…

d7bdece

…-issue-1890

robin-aws added 12 commits November 1, 2022 13:36

Merge branch 'master' into git-issue-1890

fb913ef

Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…

f50dc60

…issue-1890

Try setting the file encoding for javac

f8c402f

Correct encoding flag for javac

2b2900a

Backing off on printing non-ASCII characters for now

4db28b2

At least now I can use %testDafnyForEachCompiler!

Whitespace, trying to debug expect failure on Windows

6a4e880

More windows expect failure debugging

72cbcd8

Adding bad expect to ManualCompile.dfy

fc82fbb

Confirming my suspicion that the problem is about feeding the JS program into node as stdin

Whoops, doesn’t work for C++

67da219

Avoiding non-ASCII characters in source

3861ee1

If nothing else our test runner doesn’t seem to support them and will need more work.

Revert ManualCompile.dfy

09f6276

Merge branch 'master' into git-issue-1890

f63d9b8

cpitclaudel reviewed Nov 3, 2022

View reviewed changes

cpitclaudel approved these changes Nov 3, 2022

View reviewed changes

robin-aws merged commit c81c3ca into dafny-lang:master Nov 3, 2022

robin-aws added a commit to robin-aws/dafny that referenced this pull request Nov 3, 2022

Saving tests case I had to take out of dafny-lang#2926

81a54d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

robin-aws commented Oct 25, 2022 •

edited

Loading

cpitclaudel left a comment

cpitclaudel Oct 25, 2022

robin-aws Oct 26, 2022

cpitclaudel Oct 26, 2022

cpitclaudel Oct 26, 2022

robin-aws Oct 26, 2022

robin-aws Oct 26, 2022

cpitclaudel Oct 25, 2022

robin-aws Oct 26, 2022 •

edited

Loading

cpitclaudel Oct 26, 2022

cpitclaudel Oct 25, 2022

robin-aws Oct 26, 2022

cpitclaudel Oct 26, 2022

robin-aws Oct 26, 2022

cpitclaudel Oct 25, 2022

robin-aws Oct 26, 2022

cpitclaudel Oct 25, 2022

cpitclaudel Oct 25, 2022

robin-aws Oct 26, 2022

robin-aws Oct 26, 2022

robin-aws Oct 26, 2022

cpitclaudel Oct 27, 2022

robin-aws Oct 26, 2022

cpitclaudel left a comment

cpitclaudel Nov 3, 2022



		protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {

fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

Conversation

robin-aws commented Oct 25, 2022 • edited Loading

cpitclaudel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robin-aws Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpitclaudel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robin-aws commented Oct 25, 2022 •

edited

Loading

robin-aws Oct 26, 2022 •

edited

Loading