Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go #2926

Merged
merged 32 commits into from
Nov 3, 2022

Conversation

robin-aws
Copy link
Member

@robin-aws robin-aws commented Oct 25, 2022

Fixes #1980. Fixes #2925.

Should fix the root cause of #2934, but I'm not claiming the complete fix, as attempting to include unescaped, non-ASCII characters in a Dafny source file is revealing platform-specific headaches in the integration test runner I'd like to address in a separate PR.

Printing non-ASCII characters, especially invalid UTF-16 sequences, is still inconsistent across backends, but at least should not crash after this change. Again this is difficult to test effectively across platforms in our testing architecture.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

@robin-aws robin-aws changed the title Handle UTF-16 escapes and invalid surrogate sequences in Python and Go fix: Handle UTF-16 escapes and invalid surrogate sequences in Python and Go Oct 25, 2022
Copy link
Member

@cpitclaudel cpitclaudel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and nice fixes.

Source/DafnyRuntime/DafnyRuntime.py Outdated Show resolved Hide resolved
Source/DafnyRuntime/DafnyRuntime.py Outdated Show resolved Hide resolved
Source/DafnyRuntime/DafnyRuntime.py Show resolved Hide resolved
@@ -597,6 +597,8 @@ func (seq Seq) UniqueElements() Set {
func (seq Seq) String() string {
if seq.isString {
s := ""
// Note this doesn't produce the right string in UTF-8,
// since it converts surrogates independently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, just one I didn't want to bring in scope - I wasn't confident it was reasonable to depend on the necessary Go package to decode UTF-16. But I'll at least cut an issue (plus C++ is an even bigger mess since it uses the 8-bit char type).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is back to "it might not print the right thing but at least it won't crash, and the output of print isn't part of our "contract"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. And unfortunately the more I test the more I find ugly inconsistencies on corner cases. :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wr.Write("_dafny.Char(");
// See comment on the StringLiteralExpr case below.
if (Util.Utf16Escape.IsMatch(v)) {
char c = Util.UnescapedCharacters(v, false).Single();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it always at most one?

Copy link
Member Author

@robin-aws robin-aws Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser should be guaranteeing its exactly one. We do something equivalent in the translator already: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Verifier/Translator.ExpressionTranslator.cs#L306-L309

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, thanks. Single will raise an exception if there's one than one element, right? So, nice safety check actually.

// but Dafny allows invalid sequences of surrogate characters.
// So if any are present, just emit a sequence of the direct UTF-16 code units instead.
var s = (string)str.Value;
if (!str.IsVerbatim && Util.Utf16Escape.IsMatch(s)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the C# string contains unpaired surrogates (not \u escaped, just directly, unescaped, in the string?) Is that impossible thanks to the way we parse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It SHOULD be impossible because Dafny source files have to be UTF-8 encoded, and the scanner is decoding to int code points: https://github.com/dafny-lang/dafny/blob/master/Source/DafnyCore/Coco/Scanner.frame#L210

That code doesn't look particularly robust in the face of invalid UTF-8 sequences mind you, but it doesn't look like it's possible to produce surrogate values.

There's also the fact that the generated scanner code includes a straight cast from int to char, which could truncate a code point to 0xFFFF. I've got a fix for that in the unicode char branch (and I should probably cut an issue for it right away) but that still won't create surrogates at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sweet. I'll let you judge whether it's worth a Contract.Assert here — it might save our bacon down the line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's nothing to assert here: I'm specifically NOT assuming surrogates are used correctly here. The only concern is accurately translating escape sequences to target language escape sequences (or something else equivalent). I'm not aware of any bugs in translating arbitrary char values in the string at least.

foreach (var c in Util.UnescapedCharacters(s, str.IsVerbatim)) {
wr.Write(comma);
wr.Write($"{(int)c}");
comma = ", ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had an API to join, but maybe not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did too. There's Util.Comma but it only produces strings rather than working with ConcreteSyntaxTrees.

Source/DafnyCore/Compilers/Compiler-go.cs Outdated Show resolved Hide resolved
Comment on lines 2124 to 2126


protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {
protected override void EmitStringLiteral(string str, bool isVerbatim, ConcreteSyntaxTree wr) {

@@ -195,6 +196,9 @@ public static class Util {
UnescapedCharacters(s, isVerbatimString).Iter(ch => sb.Append(ch));
return sb.ToString();
}

public static readonly Regex Utf16Escape = new Regex(@"(?<!\\)\\u([0-9a-fA-F]{4})");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right: how about \\\uAAAA? It won't match, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, yup - I copied this from the existing ShortOctalEscape and ShortHexEscape patterns which are also wrong. :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that both of those other bogus regex patterns will be removed entirely in the fix to #2928, since they are trying to do entirely the wrong thing anyway.

/// For example, "ab\tuv\u12345" may be broken up as ["a", "b", "\t", "u", "v", "\u1234", "5"].
/// Consecutive non-escaped characters may or may not be enumerated as a single string.
/// </summary>
public static IEnumerable<string> Escapes(string p, bool isVerbatimString) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name for this method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenizeEscapedString?

Comment on lines +65 to +67
// Ensuring we're precise enough about identifying \u escapes
print "I'm afraid you'll find escape quite impossible, \\u007", "\n";
print "Luckily I have this nifty gadget from my good friend, \\\u0051", "\n";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robin-aws robin-aws added the run-deep-tests Tells CI to run all tests label Oct 26, 2022
cpitclaudel
cpitclaudel previously approved these changes Oct 27, 2022
Copy link
Member

@cpitclaudel cpitclaudel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

@@ -2276,7 +2276,7 @@ private class GenericArrayElementLvalue : ILvalue {
files.Add($"\"{Path.GetFullPath(file)}\"");
}
var classpath = GetClassPath(targetFilename);
var psi = new ProcessStartInfo("javac", string.Join(" ", files)) {
var psi = new ProcessStartInfo("javac", "-encoding UTF8 " + string.Join(" ", files)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now but we should check what encoding we use for writing.

@robin-aws robin-aws merged commit c81c3ca into dafny-lang:master Nov 3, 2022
robin-aws added a commit to robin-aws/dafny that referenced this pull request Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-deep-tests Tells CI to run all tests
Projects
None yet
2 participants