SpanScanner doesn't handle UTF-16 surrogate pairs #4

leonsenft · 2017-05-08T04:18:16Z

The following example demonstrates the issue:

import 'package:string_scanner/string_scanner.dart';

void main() {
  final text = '\u{12345}';
  final scanner = new SpanScanner(text);
  final start = scanner.state;

  while (!scanner.isDone) {
    scanner.readChar();
  }

  print(scanner.spanFrom(start));
}

This code throws:

RangeError: End 2 must not be greater than the number of characters in the file, 1.

I believe the issue is that package:string_scanner operates on code units, whereas package:source_span operates on code points (runes), resulting in this index mismatch for surrogate pairs.

I found a workaround by decoding the string first:

import 'package:source_span/source_span.dart';
import 'package:string_scanner/string_scanner.dart';

void main() {
  final text = '\u{12345}';
  final file = new SourceFile.decoded(text.codeUnits);
  final scanner = new SpanScanner.within(file.span(0));
  final start = scanner.state;

  while (!scanner.isDone) {
    scanner.readChar();
  }

  print(scanner.spanFrom(start));
}

Is the code unit and code point distinction here intentional? If so is there a better way to ensure proper UTF-16 support? The workaround seems reasonable, although having to provide a span rather than the decoded file itself is clunky. Would adding a constructor to SpanScanner which accepts a decoded file or list of code units be a suitable solution? Or could SpanScanner optionally operate on code points instead of code units?

The text was updated successfully, but these errors were encountered:

nex3 · 2017-05-16T21:52:46Z

I think the real issue here is that new SourceFile() should never have operated on code points in the first place. It's contrary to the rest of string handling throughout Dart, and I'd bet that all the code that actually handles spans assumes they refer to code units rather than code points.

This behavior runs contrary to the rest of Dart's string handling, and in particular breaks string_scanner. See dart-lang/string_scanner#4.

leonsenft · 2017-05-16T22:29:33Z

I agree the issue stems from a choice made within SourceFile; however I still think operating at a code point level is an entirely valid use case. I'll post my reasoning on dart-lang/source_span#16.

This behavior runs contrary to the rest of Dart's string handling, and in particular breaks string_scanner. See dart-lang/string_scanner#4.

Closes #4

leonsenft changed the title ~~SpanScanner doesn't handle u~~ SpanScanner doesn't handle UTF-16 surrogate pairs May 8, 2017

nex3 added a commit to dart-lang/source_span that referenced this issue May 16, 2017

Deprecate the use of runes in SourceFile.

ca109fe

This behavior runs contrary to the rest of Dart's string handling, and in particular breaks string_scanner. See dart-lang/string_scanner#4.

nex3 mentioned this issue May 16, 2017

Deprecate the use of runes in SourceFile. dart-lang/source_span#16

Merged

nex3 added a commit to dart-lang/source_span that referenced this issue May 17, 2017

Deprecate the use of runes in SourceFile. (#16)

a2ad0b8

This behavior runs contrary to the rest of Dart's string handling, and in particular breaks string_scanner. See dart-lang/string_scanner#4.

nex3 added a commit that referenced this issue May 17, 2017

Don't crash on surrogate pairs.

7f9c24f

Closes #4

nex3 mentioned this issue May 17, 2017

Don't crash on surrogate pairs. #5

Merged

nex3 closed this as completed in #5 May 22, 2017

nex3 added a commit that referenced this issue May 22, 2017

Don't crash on surrogate pairs. (#5)

005fab6

Closes #4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpanScanner doesn't handle UTF-16 surrogate pairs #4

SpanScanner doesn't handle UTF-16 surrogate pairs #4

leonsenft commented May 8, 2017 •

edited

Loading

nex3 commented May 16, 2017

leonsenft commented May 16, 2017

SpanScanner doesn't handle UTF-16 surrogate pairs #4

SpanScanner doesn't handle UTF-16 surrogate pairs #4

Comments

leonsenft commented May 8, 2017 • edited Loading

nex3 commented May 16, 2017

leonsenft commented May 16, 2017

leonsenft commented May 8, 2017 •

edited

Loading