Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neither VM nor dart2js accepts null character in source code. #18090

Closed
lrhn opened this issue Apr 8, 2014 · 10 comments
Closed

Neither VM nor dart2js accepts null character in source code. #18090

lrhn opened this issue Apr 8, 2014 · 10 comments
Assignees
Labels
area-front-end Use area-front-end for front end / CFE / kernel format related issues. front-end-fasta P4 type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@lrhn
Copy link
Member

lrhn commented Apr 8, 2014

The Dart specification allows all Unicode characters as string content (depending on the string type and delimiters). There is no exception made for the null character (ASCII NUL, U+0000).

However, neither implementation actually allow a program that contains a literal NUL inside a string literal. The Analyzer does allow it.

Example code (you have to replace the visual NUL with a real NUL):

main() {
 print("␀");
}

Posting as single bug for both implementations in case it is really a spec change that is needed.

@gbracha
Copy link
Contributor

gbracha commented Apr 8, 2014

Is there a reason why NUL is special? Is this disallowed in the browser for example?

@iposva-google
Copy link
Contributor

Please do not file multi-area bugs. Making this the dart2js bug.


Removed Area-VM label.

@iposva-google
Copy link
Contributor

Filed issue #18100 for the VM version of this bug.

@peter-ahe-google
Copy link
Contributor

We use the NUL character to detect end of file. It is a dummy value that acts as a sentinel, which is a standard optimization technique described by Knuth. I consider it a bug in dart2js that it has this limitation, but I never got around to fixing it.

Specifying this limitation should have little impact on Dart programmers: there are other, better ways to create a string with a NUL.

I'm not particularly attached to either option. The advantage of disallowing NUL would be that it is easier to manipulate Dart source code in a C program. I believe the so-called "modified UTF-8" of the JVM is exactly a codification of this limitation.

@gbracha
Copy link
Contributor

gbracha commented Apr 8, 2014

I don't have a very strong position on the substance of the matter. That said we have a specification that is effectively frozen - TC52 voting is already in progress. The specification doubtless has bugs that will have to be corrected in future revisions, but I don't think this falls into that category.

@lrhn
Copy link
Member Author

lrhn commented Apr 9, 2014

JavaScript does not have a problem with ASCII NULL in script code.
It seems browsers may have a problem with it in HTML code, but in a separate script file, it works fine.

The "modified UTF-8" has the disadvantage of not being valid UTF-8. It is not accepted by our own UTF8.decode:

    import "dart:convert";
    main() { print(UTF8.decode([0xc0, 0x80]).codeUnitAt(0)); }

gives: FormatException: Overlong encoding of 0x0
   

@kevmoo kevmoo added type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) and removed priority-unassigned labels Feb 29, 2016
@vsmenon vsmenon added the area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. label Jul 20, 2019
@rakudrama
Copy link
Member

@johnniwinther does the CFE accept \u0000 in string literals and comments?

@lrhn
Copy link
Member Author

lrhn commented Dec 14, 2021

Most likely it does not. If I create a file containing a zero byte inside a string literal,

void main() => print("�");

as tests/language/zero_byte_test.dart, and run it with the fasta target, it fails to see the bytes after the zero-byte.
The errors are all over the place:

tests/language/zero_byte_test.dart:1:22: Error: String starting with " must end with ".
void main() => print("");
                     ^
tests/language/zero_byte_test.dart:1:22: Error: The control character U+0000 can only be used in strings and comments.
void main() => print("");
                     ^
tests/language/zero_byte_test.dart:1:24: Error: String starting with " must end with ".
void main() => print("");
                       ^^^
tests/language/zero_byte_test.dart:1:21: Error: Can't find ')' to match '('.
void main() => print("");
                    ^
tests/language/zero_byte_test.dart:1:24: Error: Expected ';' after this.
void main() => print("");
                       ^^^
tests/language/zero_byte_test.dart:2:1: Error: Unexpected token ''.

So, the parser appears to not be able to handle zero-bytes in the source (even though its error message suggests that it should).

Also, from utf8_bytes_scanner.dart:

class Utf8BytesScanner extends AbstractScanner {
  /**
   * The file content.
   *
   * The content is zero-terminated.
   */
  final List<int> bytes;

which doesn't prove that it's actually used as terminator, but it is slightly worrying.

@johnniwinther
Copy link
Member

I guess that if this is a parser issue, the analyzer should fail similarly on it.

@lrhn
Copy link
Member Author

lrhn commented Dec 14, 2021

The analyzer does fail:

unexpected analysis errors in zero_byte_test.dart:
- Line 1, column 22: SYNTACTIC_ERROR.ILLEGAL_CHARACTER
  Illegal character '0'.

- Line 1, column 22: SYNTACTIC_ERROR.UNTERMINATED_STRING_LITERAL
  Unterminated string literal.

- Line 1, column 24: SYNTACTIC_ERROR.EXPECTED_TOKEN
  Expected to find ';'.

- Line 1, column 26: SYNTACTIC_ERROR.UNTERMINATED_STRING_LITERAL
  Unterminated string literal.

- Line 2, column 1: SYNTACTIC_ERROR.EXPECTED_TOKEN
  Expected to find ')'.

@lrhn lrhn added area-front-end Use area-front-end for front end / CFE / kernel format related issues. front-end-fasta and removed web-dart2js area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. labels Dec 21, 2021
@johnniwinther johnniwinther added P3 A lower priority bug or feature request P4 and removed P3 A lower priority bug or feature request labels Jul 4, 2022
copybara-service bot pushed a commit that referenced this issue Sep 25, 2024
…ecks

The (utf8) scanner currently has this thing where you give it a
0-terminated byte-array (i.e. you read the file, then allocate
something that's 1 bigger, copy the data, then give it to the scanner)
to 'avoid bounds checks'.
Dart still has bounds checks though - they're just implicit.

As for the string scanner ut gets a string, then creates a new string
like `string + '\x00'` - so basically the same thing.

This CL uses the 'vm:unsafe:no-bounds-checks' pragma, removing the
implicit bounds checks, adding explicit bounds checks,
saving ~73.6 mio instructions when compiling the CFE in the process:

```
Comparing snapshot #1 with snapshot #2
cycles:u: -0.9983% +/- 0.6563% (-174026333.30 +/- 114410028.98)
instructions:u: -0.3416% +/- 0.0005% (-73659267.00 +/- 108567.20)
branch-misses:u: -4.8952% +/- 2.2612% (-3172939.50 +/- 1465641.18)
```

With the scanner-benchmark with `--bytes` I get this:

```
msec task-clock:u: -1.2251% +/- 0.6355% (-50.64 +/- 26.27)
cycles:u: -1.2376% +/- 0.6385% (-223642830.80 +/- 115393789.68)
instructions:u: -2.8155% +/- 0.0000% (-1153243856.00 +/- 428.11)
seconds time elapsed: -1.2165% +/- 0.6408% (-0.05 +/- 0.03)
seconds user: -1.1539% +/- 0.6495% (-0.05 +/- 0.03)
```

With the scanner-benchmark with `--string` I get this:

```
msec task-clock:u: -7.6439% +/- 0.6628% (-366.08 +/- 31.74)
page-faults:u: -95.0034% +/- 0.0014% (-228023.50 +/- 3.41)
instructions:u: 2.1041% +/- 0.0000% (897941907.60 +/- 2082.79)
branch-misses:u: 3.2994% +/- 1.4675% (3239735.30 +/- 1440940.88)
seconds time elapsed: -7.6595% +/- 0.6610% (-0.37 +/- 0.03)
seconds user: -0.8801% +/- 0.7676% (-0.04 +/- 0.03)
seconds sys: -92.0140% +/- 2.8075% (-0.33 +/- 0.01)
MarkSweep(   old space) goes from 6 to 0
Notice combined GC time goes from 112 ms to 41 ms (notice only 1 run each).
```

Where I'll note that the 'vm:unsafe:no-bounds-checks' pragma doesn't
(yet?) work for `String.codeUnitAt`.
See https://dart-review.googlesource.com/c/sdk/+/384540
(and https://dart-review.googlesource.com/c/sdk/+/385201) for details.
I assume the relatively  big change here is caused by not allocating
a new string with a 0-byte in the end each time.

Note that the read-allocate-copy dance is still performed for the utf8
scanner in this CL as it requires changing all call-sites instead.
It will be done in a follow-up CL where the "end-of-file" int will
likely also be changed to `-1` to (I assume) allow for having the
0-byte in the middle of a file (see also the 10+ year old bug at
#18090)

Note: The pragma (currently?) only has effect in AOT and this change
will (for the utf8 scanner) make the JIT version slower
(probably by the same ~73.6 mio instructions as - at least in AOT -
the implicit check is 6 instructions and the explicit one is 3
instructions). As the pragma doesn't work in the StringScanner anyway
I expect the change to be somewhat equivalent there. Once the
read-allocate-copy dance is also removed from the utf8 scanner I expect
the combined result to be positive all around.

Update: With https://dart-review.googlesource.com/c/sdk/+/385201 landed
I get these changes:

Compiling the CFE:
```
instructions:u: -0.4520% +/- 0.0002% (-98470955.29 +/- 42253.40)
```

Scanner benchmark with `--bytes`:

```
msec task-clock:u: -2.1758% +/- 0.2316% (-92.07 +/- 9.80)
cycles:u: -2.1941% +/- 0.2283% (-405224983.11 +/- 42160655.88)
instructions:u: -3.1049% +/- 0.0000% (-1272360052.95 +/- 706.54)
branch-misses:u: 2.4718% +/- 0.5142% (2371345.23 +/- 493257.76)
seconds time elapsed: -2.1761% +/- 0.2317% (-0.09 +/- 0.01)
seconds user: -2.2071% +/- 0.2308% (-0.09 +/- 0.01)
```

Scanner benchmark with `--string`:

```
msec task-clock:u: -15.0073% +/- 0.2175% (-745.93 +/- 10.81)
page-faults:u: -95.0035% +/- 0.0003% (-228024.25 +/- 0.81)
cycles:u: -7.7986% +/- 0.2329% (-1558985588.99 +/- 46560962.79)
instructions:u: -3.7054% +/- 0.0000% (-1581977447.66 +/- 481.68)
branch-misses:u: -0.6880% +/- 0.5818% (-689453.22 +/- 583101.50)
seconds time elapsed: -15.0198% +/- 0.2170% (-0.75 +/- 0.01)
seconds user: -8.8149% +/- 0.2648% (-0.41 +/- 0.01)
seconds sys: -94.1247% +/- 1.6444% (-0.34 +/- 0.01)
MarkSweep(   old space) goes from 6 to 0
```

Change-Id: I524a21f488da7df5dc9d2cdf40112b84896ad3e0
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/383324
Reviewed-by: Brian Wilkerson <[email protected]>
Reviewed-by: Johnni Winther <[email protected]>
Commit-Queue: Jens Johansen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-front-end Use area-front-end for front end / CFE / kernel format related issues. front-end-fasta P4 type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

9 participants