Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

Merged
merged 32 commits into from
Nov 3, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
9a6face
Adding surrogate edge cases to Strings.dfy
robin-aws Oct 25, 2022
ddd52cc
Handle UTF-16 \u escapes in Python and Go (mostly)
robin-aws Oct 25, 2022
d5ba377
Whitespace
robin-aws Oct 25, 2022
396f466
Merge branch 'master' into git-issue-1890
robin-aws Oct 25, 2022
4b258f2
PR feedback, especially replacing bogus regex
robin-aws Oct 26, 2022
ec6f2da
Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…
robin-aws Oct 26, 2022
a81cea9
Apply suggestions from code review
robin-aws Oct 26, 2022
62e0f2e
Update Source/DafnyRuntime/DafnyRuntime.py
robin-aws Oct 26, 2022
1c2b871
Whitespace
robin-aws Oct 26, 2022
2c217ae
Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…
robin-aws Oct 26, 2022
057bf25
Poke CI
robin-aws Oct 26, 2022
7e3d243
Method rename, hopefully fixing Java testing flakiness
robin-aws Oct 27, 2022
eccc1e6
Merge branch 'master' of https://github.com/dafny-lang/dafny into git…
robin-aws Oct 27, 2022
9670c7f
Handle direct non-ASCII characters as well, release note
robin-aws Oct 27, 2022
212f3ee
Merge branch 'master' into git-issue-1890
robin-aws Oct 31, 2022
79d2f11
Revert attempt to fix Java stdout encoding (since Clement’s fix is in…
robin-aws Oct 31, 2022
2dba4e3
Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…
robin-aws Oct 31, 2022
86f2b89
Dumb edit typo
robin-aws Oct 31, 2022
58a63ef
Update allocated1 version of Strings.dfy
robin-aws Oct 31, 2022
d7bdece
Merge branch 'master' of https://github.com/dafny-lang/dafny into git…
robin-aws Nov 1, 2022
fb913ef
Merge branch 'master' into git-issue-1890
robin-aws Nov 1, 2022
f50dc60
Merge branch 'git-issue-1890' of github.com:robin-aws/dafny into git-…
robin-aws Nov 1, 2022
f8c402f
Try setting the file encoding for javac
robin-aws Nov 1, 2022
2b2900a
Correct encoding flag for javac
robin-aws Nov 1, 2022
4db28b2
Backing off on printing non-ASCII characters for now
robin-aws Nov 2, 2022
6a4e880
Whitespace, trying to debug expect failure on Windows
robin-aws Nov 2, 2022
72cbcd8
More windows expect failure debugging
robin-aws Nov 2, 2022
fc82fbb
Adding bad expect to ManualCompile.dfy
robin-aws Nov 2, 2022
67da219
Whoops, doesn’t work for C++
robin-aws Nov 2, 2022
3861ee1
Avoiding non-ASCII characters in source
robin-aws Nov 3, 2022
09f6276
Revert ManualCompile.dfy
robin-aws Nov 3, 2022
f63d9b8
Merge branch 'master' into git-issue-1890
robin-aws Nov 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 30 additions & 4 deletions Source/DafnyCore/Compilers/Compiler-go.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2061,12 +2061,36 @@ protected override void EmitLiteralExpr(ConcreteSyntaxTree wr, LiteralExpr e) {
wr.Write((bool)e.Value ? "true" : "false");
} else if (e is CharLiteralExpr) {
var v = (string)e.Value;
wr.Write("_dafny.Char('{0}')", TranslateEscapes(v, isChar: true));
robin-aws marked this conversation as resolved.
Show resolved Hide resolved
wr.Write("_dafny.Char(");
// See comment on the StringLiteralExpr case below.
if (Util.Utf16Escape.IsMatch(v)) {
char c = Util.UnescapedCharacters(v, false).Single();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it always at most one?

Copy link
Member Author

@robin-aws robin-aws Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser should be guaranteeing its exactly one. We do something equivalent in the translator already: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Verifier/Translator.ExpressionTranslator.cs#L306-L309

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, thanks. Single will raise an exception if there's one than one element, right? So, nice safety check actually.

wr.Write($"{(int)c}");
} else {
wr.Write("'{0}'", TranslateEscapes(v, isChar: true));
}
wr.Write(")");
} else if (e is StringLiteralExpr) {
var str = (StringLiteralExpr)e;
wr.Write("_dafny.SeqOfString(");
TrStringLiteral(str, wr);
wr.Write(")");
// It may not be possible to translate \u escapes into a valid Go string,
// since Go string literals have to be encodable in UTF-8,
// but Dafny allows invalid sequences of surrogate characters.
// So if any are present, just emit a sequence of the direct UTF-16 code units instead.
var s = (string)str.Value;
if (!str.IsVerbatim && Util.Utf16Escape.IsMatch(s)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the C# string contains unpaired surrogates (not \u escaped, just directly, unescaped, in the string?) Is that impossible thanks to the way we parse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It SHOULD be impossible because Dafny source files have to be UTF-8 encoded, and the scanner is decoding to int code points: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Coco/Scanner.frame#L210

That code doesn't look particularly robust in the face of invalid UTF-8 sequences mind you, but it doesn't look like it's possible to produce surrogate values.

There's also the fact that the generated scanner code includes a straight cast from int to char, which could truncate a code point to 0xFFFF. I've got a fix for that in the unicode char branch (and I should probably cut an issue for it right away) but that still won't create surrogates at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sweet. I'll let you judge whether it's worth a Contract.Assert here — it might save our bacon down the line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's nothing to assert here: I'm specifically NOT assuming surrogates are used correctly here. The only concern is accurately translating escape sequences to target language escape sequences (or something else equivalent). I'm not aware of any bugs in translating arbitrary char values in the string at least.

wr.Write("_dafny.SeqOfChars(");
var comma = "";
foreach (var c in Util.UnescapedCharacters(s, str.IsVerbatim)) {
wr.Write(comma);
wr.Write($"{(int)c}");
comma = ", ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had an API to join, but maybe not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did too. There's Util.Comma but it only produces strings rather than working with ConcreteSyntaxTrees.

}
wr.Write(")");
robin-aws marked this conversation as resolved.
Show resolved Hide resolved
} else {
wr.Write("_dafny.SeqOfString(");
TrStringLiteral(str, wr);
wr.Write(")");
}
} else if (AsNativeType(e.Type) is NativeType nt) {
wr.Write("{0}({1})", GetNativeTypeName(nt), (BigInteger)e.Value);
} else if (e.Value is BigInteger i) {
Expand Down Expand Up @@ -2097,6 +2121,8 @@ void EmitIntegerLiteral(BigInteger i, ConcreteSyntaxTree wr) {
}
}



protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {

var n = str.Length;
if (!isVerbatim) {
Expand Down
4 changes: 4 additions & 0 deletions Source/DafnyCore/Util.cs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
using System.Diagnostics.Contracts;
using System.Reactive.Disposables;
using System.Reactive.Linq;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using JetBrains.Annotations;
using Microsoft.Boogie;
Expand Down Expand Up @@ -195,6 +196,9 @@ public static string RemoveEscaping(string s, bool isVerbatimString) {
UnescapedCharacters(s, isVerbatimString).Iter(ch => sb.Append(ch));
return sb.ToString();
}

public static readonly Regex Utf16Escape = new Regex(@"(?<!\\)\\u([0-9a-fA-F]{4})");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right: how about \\\uAAAA? It won't match, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, yup - I copied this from the existing ShortOctalEscape and ShortHexEscape patterns which are also wrong. :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that both of those other bogus regex patterns will be removed entirely in the fix to #2928, since they are trying to do entirely the wrong thing anyway.


/// <summary>
/// Returns the characters of the well-parsed string p, replacing any
/// escaped characters by the actual characters.
Expand Down
2 changes: 2 additions & 0 deletions Source/DafnyRuntime/DafnyRuntime.go
Original file line number Diff line number Diff line change
Expand Up @@ -597,6 +597,8 @@ func (seq Seq) UniqueElements() Set {
func (seq Seq) String() string {
if seq.isString {
s := ""
// Note this doesn't produce the right string in UTF-8,
// since it converts surrogates independently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, just one I didn't want to bring in scope - I wasn't confident it was reasonable to depend on the necessary Go package to decode UTF-16. But I'll at least cut an issue (plus C++ is an even bigger mess since it uses the 8-bit char type).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is back to "it might not print the right thing but at least it won't crash, and the output of print isn't part of our "contract"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. And unfortunately the more I test the more I find ugly inconsistencies on corner cases. :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _, c := range seq.contents {
s += c.(Char).String()
}
Expand Down
13 changes: 12 additions & 1 deletion Source/DafnyRuntime/DafnyRuntime.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,24 @@ def __get__(self, instance, owner):
def print(value):
builtins.print(string_of(value), end="")

# Dafny strings are currently sequences of UTF-16 code units.
# To make a best effort attempt at printing the right characters we attempt to decode,
# but have to allow for invalid sequences.
def string_from_utf_16(utf_16_code_units):
return b''.join([ord(c).to_bytes(2, 'little') for c in utf_16_code_units]).decode("utf-16", errors = 'replace')
robin-aws marked this conversation as resolved.
Show resolved Hide resolved

def string_of(value) -> str:
if hasattr(value, '__dafnystr__'):
return value.__dafnystr__()
elif value is None:
return "null"
elif isinstance(value, bool):
return "true" if value else "false"
elif isinstance(value, str):
# This is only for Dafny char values.
# Dafny strings are represented as Seq's of indivdual char values,
robin-aws marked this conversation as resolved.
Show resolved Hide resolved
# and Seq defines __dafnystr__.
return string_from_utf_16(value)
cpitclaudel marked this conversation as resolved.
Show resolved Hide resolved
elif isinstance(value, tuple):
return '(' + ', '.join(map(string_of, value)) + ')'
elif isinstance(value, FunctionType):
Expand Down Expand Up @@ -82,7 +93,7 @@ def UniqueElements(self):

def __dafnystr__(self) -> str:
if self.isStr:
return ''.join(self)
return string_from_utf_16(self)
return '[' + ', '.join(map(string_of, self)) + ']'

def __add__(self, other):
Expand Down
48 changes: 45 additions & 3 deletions Test/dafny0/Strings.dfy
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
// RUN: %dafny /compile:3 /print:"%t.print" /dprint:"%t.dprint" "%s" > "%t"
// RUN: %baredafny verify %args "%s" > "%t"
// RUN: %baredafny run --no-verify --target=cs %args "%s" >> "%t"
// RUN: %baredafny run --no-verify --target=js %args "%s" >> "%t"
// RUN: %baredafny run --no-verify --target=go %args "%s" >> "%t"
// RUN: %baredafny run --no-verify --target=java %args "%s" >> "%t"
// RUN: %baredafny run --no-verify --target=py %args "%s" >> "%t"
// RUN: %diff "%s.expect" "%t"

method Char(a: char, s: string, i: int) returns (b: char)
Expand Down Expand Up @@ -31,12 +36,31 @@ method Main()
var s, t := M(ch, ch);
print "ch = ", ch, "\n";
print "The string is: " + s + "\n";
var x, y, z := Escapes();
var x, y, z, zz := Escapes();
print "Escape X: ", x, "\n";
print "Escape Y: ", y, "\n";
print "Escape Z: ", z, "\n";
print "Escape ZZ: ", zz, "\n";
var c, d := CharEscapes();
print "Here is the end" + [c, d] + [' ', ' ', ' '] + [[d]][0] + " ", d, "\n";

var x?, y?, z? := WeirdStrings();

// Printing these invalid (in UTF-16) strings can lead to at least inconsistent
// output across the backends, but they should never crash.
// We assert that the invalid state is modelled correctly as well.
expect |x?| == 30;
expect x?[29] as int == 55296;
print "Weird string X: ", x?, "\n";
expect |y?| == 30;
expect x?[29] as int == 55296;
print "Weird string Y: ", y?, "\n";
expect |z?| > 2;
expect z?[0..2] == ['\ude0e', '\ud83d'];
print "Weird string Z: ", z?, "\n";

var c?, d? := WeirdChars();
print "These characters are quite confused: ", c?, ' ', d?, "\n";
}

method GimmieAChar(s: string) returns (ch: char)
Expand All @@ -50,12 +74,14 @@ method GimmieAChar(s: string) returns (ch: char)
}
}

method Escapes() returns (x: string, y: string, z: string)
method Escapes() returns (x: string, y: string, z: string, zz: string)
ensures |zz| > 2
{
x := "I say \"hello\" \\ you say \'good bye'";
y := @"I say ""hello"" \ you say 'good bye'";
assert x == y;
z := "There needs to be \u0052\u0026\u0044\n\tYes, sir";
zz := "\ud83d\ude0e is the UTF-16 for a very cool emoji";
}

method CharEscapes() returns (c: char, d: char)
Expand All @@ -67,3 +93,19 @@ method CharEscapes() returns (c: char, d: char)
c := '\n';
d := '*';
}

// Strings that aren't valid UTF-16 sequences
method WeirdStrings() returns (x: string, y: string, z: string)
{
x := "What even is this character: \uD800";
y := "What even is this character: " + [0xD800 as char];
assert x == y;
z := "\ude0e\ud83d is not using surrogates correctly";
}

// Surrogate code points
method WeirdChars() returns (c: char, d: char)
{
c := '\uD800';
d := 0xDFFF as char;
}
69 changes: 68 additions & 1 deletion Test/dafny0/Strings.dfy.expect
Original file line number Diff line number Diff line change
@@ -1,10 +1,77 @@

Dafny program verifier finished with 6 verified, 0 errors
Dafny program verifier finished with 8 verified, 0 errors

Dafny program verifier did not attempt verification
ch = D
The string is: DDD
Escape X: I say "hello" \ you say 'good bye'
Escape Y: I say "hello" \ you say 'good bye'
Escape Z: There needs to be R&D
Yes, sir
Escape ZZ: 😎 is the UTF-16 for a very cool emoji
Here is the end
* * *
Weird string X: What even is this character: �
Weird string Y: What even is this character: �
Weird string Z: �� is not using surrogates correctly
These characters are quite confused: � �

Dafny program verifier did not attempt verification
ch = D
The string is: DDD
Escape X: I say "hello" \ you say 'good bye'
Escape Y: I say "hello" \ you say 'good bye'
Escape Z: There needs to be R&D
Yes, sir
Escape ZZ: 😎 is the UTF-16 for a very cool emoji
Here is the end
* * *
Weird string X: What even is this character: �
Weird string Y: What even is this character: �
Weird string Z: �� is not using surrogates correctly
These characters are quite confused: � �

Dafny program verifier did not attempt verification
ch = D
The string is: DDD
Escape X: I say "hello" \ you say 'good bye'
Escape Y: I say "hello" \ you say 'good bye'
Escape Z: There needs to be R&D
Yes, sir
Escape ZZ: �� is the UTF-16 for a very cool emoji
Here is the end
* * *
Weird string X: What even is this character: �
Weird string Y: What even is this character: �
Weird string Z: �� is not using surrogates correctly
These characters are quite confused: � �

Dafny program verifier did not attempt verification
ch = D
The string is: DDD
Escape X: I say "hello" \ you say 'good bye'
Escape Y: I say "hello" \ you say 'good bye'
Escape Z: There needs to be R&D
Yes, sir
Escape ZZ: 😎 is the UTF-16 for a very cool emoji
Here is the end
* * *
Weird string X: What even is this character: ?
Weird string Y: What even is this character: ?
Weird string Z: ?? is not using surrogates correctly
These characters are quite confused: ? ?

Dafny program verifier did not attempt verification
ch = D
The string is: DDD
Escape X: I say "hello" \ you say 'good bye'
Escape Y: I say "hello" \ you say 'good bye'
Escape Z: There needs to be R&D
Yes, sir
Escape ZZ: 😎 is the UTF-16 for a very cool emoji
Here is the end
* * *
Weird string X: What even is this character: �
Weird string Y: What even is this character: �
Weird string Z: �� is not using surrogates correctly
These characters are quite confused: � �