Neither VM nor dart2js accepts null character in source code. #18090

lrhn · 2014-04-08T09:39:35Z

The Dart specification allows all Unicode characters as string content (depending on the string type and delimiters). There is no exception made for the null character (ASCII NUL, U+0000).

However, neither implementation actually allow a program that contains a literal NUL inside a string literal. The Analyzer does allow it.

Example code (you have to replace the visual NUL with a real NUL):

main() {
print("␀");
}

Posting as single bug for both implementations in case it is really a spec change that is needed.

gbracha · 2014-04-08T19:28:10Z

Is there a reason why NUL is special? Is this disallowed in the browser for example?

iposva-google · 2014-04-08T19:47:26Z

Please do not file multi-area bugs. Making this the dart2js bug.

Removed Area-VM label.

iposva-google · 2014-04-08T19:54:58Z

Filed issue #18100 for the VM version of this bug.

peter-ahe-google · 2014-04-08T20:27:18Z

We use the NUL character to detect end of file. It is a dummy value that acts as a sentinel, which is a standard optimization technique described by Knuth. I consider it a bug in dart2js that it has this limitation, but I never got around to fixing it.

Specifying this limitation should have little impact on Dart programmers: there are other, better ways to create a string with a NUL.

I'm not particularly attached to either option. The advantage of disallowing NUL would be that it is easier to manipulate Dart source code in a C program. I believe the so-called "modified UTF-8" of the JVM is exactly a codification of this limitation.

gbracha · 2014-04-08T20:36:11Z

I don't have a very strong position on the substance of the matter. That said we have a specification that is effectively frozen - TC52 voting is already in progress. The specification doubtless has bugs that will have to be corrected in future revisions, but I don't think this falls into that category.

lrhn · 2014-04-09T07:28:32Z

JavaScript does not have a problem with ASCII NULL in script code.
It seems browsers may have a problem with it in HTML code, but in a separate script file, it works fine.

The "modified UTF-8" has the disadvantage of not being valid UTF-8. It is not accepted by our own UTF8.decode:

import "dart:convert";
main() { print(UTF8.decode([0xc0, 0x80]).codeUnitAt(0)); }

gives: FormatException: Overlong encoding of 0x0

rakudrama · 2021-12-14T08:49:13Z

@johnniwinther does the CFE accept \u0000 in string literals and comments?

lrhn · 2021-12-14T10:39:21Z

Most likely it does not. If I create a file containing a zero byte inside a string literal,

void main() => print("�");

as tests/language/zero_byte_test.dart, and run it with the fasta target, it fails to see the bytes after the zero-byte.
The errors are all over the place:

tests/language/zero_byte_test.dart:1:22: Error: String starting with " must end with ".
void main() => print("");
                     ^
tests/language/zero_byte_test.dart:1:22: Error: The control character U+0000 can only be used in strings and comments.
void main() => print("");
                     ^
tests/language/zero_byte_test.dart:1:24: Error: String starting with " must end with ".
void main() => print("");
                       ^^^
tests/language/zero_byte_test.dart:1:21: Error: Can't find ')' to match '('.
void main() => print("");
                    ^
tests/language/zero_byte_test.dart:1:24: Error: Expected ';' after this.
void main() => print("");
                       ^^^
tests/language/zero_byte_test.dart:2:1: Error: Unexpected token ''.

So, the parser appears to not be able to handle zero-bytes in the source (even though its error message suggests that it should).

Also, from utf8_bytes_scanner.dart:

class Utf8BytesScanner extends AbstractScanner {
  /**
   * The file content.
   *
   * The content is zero-terminated.
   */
  final List<int> bytes;

which doesn't prove that it's actually used as terminator, but it is slightly worrying.

johnniwinther · 2021-12-14T11:14:39Z

I guess that if this is a parser issue, the analyzer should fail similarly on it.

lrhn · 2021-12-14T11:27:54Z

The analyzer does fail:

unexpected analysis errors in zero_byte_test.dart:
- Line 1, column 22: SYNTACTIC_ERROR.ILLEGAL_CHARACTER
  Illegal character '0'.

- Line 1, column 22: SYNTACTIC_ERROR.UNTERMINATED_STRING_LITERAL
  Unterminated string literal.

- Line 1, column 24: SYNTACTIC_ERROR.EXPECTED_TOKEN
  Expected to find ';'.

- Line 1, column 26: SYNTACTIC_ERROR.UNTERMINATED_STRING_LITERAL
  Unterminated string literal.

- Line 2, column 1: SYNTACTIC_ERROR.EXPECTED_TOKEN
  Expected to find ')'.

…ecks The (utf8) scanner currently has this thing where you give it a 0-terminated byte-array (i.e. you read the file, then allocate something that's 1 bigger, copy the data, then give it to the scanner) to 'avoid bounds checks'. Dart still has bounds checks though - they're just implicit. As for the string scanner ut gets a string, then creates a new string like `string + '\x00'` - so basically the same thing. This CL uses the 'vm:unsafe:no-bounds-checks' pragma, removing the implicit bounds checks, adding explicit bounds checks, saving ~73.6 mio instructions when compiling the CFE in the process: ``` Comparing snapshot #1 with snapshot #2 cycles:u: -0.9983% +/- 0.6563% (-174026333.30 +/- 114410028.98) instructions:u: -0.3416% +/- 0.0005% (-73659267.00 +/- 108567.20) branch-misses:u: -4.8952% +/- 2.2612% (-3172939.50 +/- 1465641.18) ``` With the scanner-benchmark with `--bytes` I get this: ``` msec task-clock:u: -1.2251% +/- 0.6355% (-50.64 +/- 26.27) cycles:u: -1.2376% +/- 0.6385% (-223642830.80 +/- 115393789.68) instructions:u: -2.8155% +/- 0.0000% (-1153243856.00 +/- 428.11) seconds time elapsed: -1.2165% +/- 0.6408% (-0.05 +/- 0.03) seconds user: -1.1539% +/- 0.6495% (-0.05 +/- 0.03) ``` With the scanner-benchmark with `--string` I get this: ``` msec task-clock:u: -7.6439% +/- 0.6628% (-366.08 +/- 31.74) page-faults:u: -95.0034% +/- 0.0014% (-228023.50 +/- 3.41) instructions:u: 2.1041% +/- 0.0000% (897941907.60 +/- 2082.79) branch-misses:u: 3.2994% +/- 1.4675% (3239735.30 +/- 1440940.88) seconds time elapsed: -7.6595% +/- 0.6610% (-0.37 +/- 0.03) seconds user: -0.8801% +/- 0.7676% (-0.04 +/- 0.03) seconds sys: -92.0140% +/- 2.8075% (-0.33 +/- 0.01) MarkSweep( old space) goes from 6 to 0 Notice combined GC time goes from 112 ms to 41 ms (notice only 1 run each). ``` Where I'll note that the 'vm:unsafe:no-bounds-checks' pragma doesn't (yet?) work for `String.codeUnitAt`. See https://dart-review.googlesource.com/c/sdk/+/384540 (and https://dart-review.googlesource.com/c/sdk/+/385201) for details. I assume the relatively big change here is caused by not allocating a new string with a 0-byte in the end each time. Note that the read-allocate-copy dance is still performed for the utf8 scanner in this CL as it requires changing all call-sites instead. It will be done in a follow-up CL where the "end-of-file" int will likely also be changed to `-1` to (I assume) allow for having the 0-byte in the middle of a file (see also the 10+ year old bug at #18090) Note: The pragma (currently?) only has effect in AOT and this change will (for the utf8 scanner) make the JIT version slower (probably by the same ~73.6 mio instructions as - at least in AOT - the implicit check is 6 instructions and the explicit one is 3 instructions). As the pragma doesn't work in the StringScanner anyway I expect the change to be somewhat equivalent there. Once the read-allocate-copy dance is also removed from the utf8 scanner I expect the combined result to be positive all around. Update: With https://dart-review.googlesource.com/c/sdk/+/385201 landed I get these changes: Compiling the CFE: ``` instructions:u: -0.4520% +/- 0.0002% (-98470955.29 +/- 42253.40) ``` Scanner benchmark with `--bytes`: ``` msec task-clock:u: -2.1758% +/- 0.2316% (-92.07 +/- 9.80) cycles:u: -2.1941% +/- 0.2283% (-405224983.11 +/- 42160655.88) instructions:u: -3.1049% +/- 0.0000% (-1272360052.95 +/- 706.54) branch-misses:u: 2.4718% +/- 0.5142% (2371345.23 +/- 493257.76) seconds time elapsed: -2.1761% +/- 0.2317% (-0.09 +/- 0.01) seconds user: -2.2071% +/- 0.2308% (-0.09 +/- 0.01) ``` Scanner benchmark with `--string`: ``` msec task-clock:u: -15.0073% +/- 0.2175% (-745.93 +/- 10.81) page-faults:u: -95.0035% +/- 0.0003% (-228024.25 +/- 0.81) cycles:u: -7.7986% +/- 0.2329% (-1558985588.99 +/- 46560962.79) instructions:u: -3.7054% +/- 0.0000% (-1581977447.66 +/- 481.68) branch-misses:u: -0.6880% +/- 0.5818% (-689453.22 +/- 583101.50) seconds time elapsed: -15.0198% +/- 0.2170% (-0.75 +/- 0.01) seconds user: -8.8149% +/- 0.2648% (-0.41 +/- 0.01) seconds sys: -94.1247% +/- 1.6444% (-0.34 +/- 0.01) MarkSweep( old space) goes from 6 to 0 ``` Change-Id: I524a21f488da7df5dc9d2cdf40112b84896ad3e0 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/383324 Reviewed-by: Brian Wilkerson <[email protected]> Reviewed-by: Johnni Winther <[email protected]> Commit-Queue: Jens Johansen <[email protected]>

lrhn added Type-Defect web-dart2js labels Apr 9, 2014

iposva-google mentioned this issue Jun 3, 2015

Allow NUL character in Strings #18100

Closed

kevmoo added type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) and removed priority-unassigned labels Feb 29, 2016

vsmenon added the area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. label Jul 20, 2019

lrhn added area-front-end Use area-front-end for front end / CFE / kernel format related issues. front-end-fasta and removed web-dart2js area-web Use area-web for Dart web related issues, including the DDC and dart2js compilers and JS interop. labels Dec 21, 2021

johnniwinther added P3 A lower priority bug or feature request P4 and removed P3 A lower priority bug or feature request labels Jul 4, 2022

johnniwinther assigned jensjoha Jul 4, 2022

copybara-service bot closed this as completed in a086649 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neither VM nor dart2js accepts null character in source code. #18090

Neither VM nor dart2js accepts null character in source code. #18090

lrhn commented Apr 8, 2014

gbracha commented Apr 8, 2014

iposva-google commented Apr 8, 2014

iposva-google commented Apr 8, 2014

peter-ahe-google commented Apr 8, 2014

gbracha commented Apr 8, 2014

lrhn commented Apr 9, 2014

rakudrama commented Dec 14, 2021

lrhn commented Dec 14, 2021 •

edited

Loading

johnniwinther commented Dec 14, 2021

lrhn commented Dec 14, 2021

Neither VM nor dart2js accepts null character in source code. #18090

Neither VM nor dart2js accepts null character in source code. #18090

Comments

lrhn commented Apr 8, 2014

gbracha commented Apr 8, 2014

iposva-google commented Apr 8, 2014

iposva-google commented Apr 8, 2014

peter-ahe-google commented Apr 8, 2014

gbracha commented Apr 8, 2014

lrhn commented Apr 9, 2014

rakudrama commented Dec 14, 2021

lrhn commented Dec 14, 2021 • edited Loading

johnniwinther commented Dec 14, 2021

lrhn commented Dec 14, 2021

lrhn commented Dec 14, 2021 •

edited

Loading