Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpanScanner doesn't handle UTF-16 surrogate pairs #4

Closed
leonsenft opened this issue May 8, 2017 · 2 comments
Closed

SpanScanner doesn't handle UTF-16 surrogate pairs #4

leonsenft opened this issue May 8, 2017 · 2 comments

Comments

@leonsenft
Copy link

leonsenft commented May 8, 2017

The following example demonstrates the issue:

import 'package:string_scanner/string_scanner.dart';

void main() {
  final text = '\u{12345}';
  final scanner = new SpanScanner(text);
  final start = scanner.state;

  while (!scanner.isDone) {
    scanner.readChar();
  }

  print(scanner.spanFrom(start));
}

This code throws:

RangeError: End 2 must not be greater than the number of characters in the file, 1.

I believe the issue is that package:string_scanner operates on code units, whereas package:source_span operates on code points (runes), resulting in this index mismatch for surrogate pairs.

I found a workaround by decoding the string first:

import 'package:source_span/source_span.dart';
import 'package:string_scanner/string_scanner.dart';

void main() {
  final text = '\u{12345}';
  final file = new SourceFile.decoded(text.codeUnits);
  final scanner = new SpanScanner.within(file.span(0));
  final start = scanner.state;

  while (!scanner.isDone) {
    scanner.readChar();
  }

  print(scanner.spanFrom(start));
}

Is the code unit and code point distinction here intentional? If so is there a better way to ensure proper UTF-16 support? The workaround seems reasonable, although having to provide a span rather than the decoded file itself is clunky. Would adding a constructor to SpanScanner which accepts a decoded file or list of code units be a suitable solution? Or could SpanScanner optionally operate on code points instead of code units?

@leonsenft leonsenft changed the title SpanScanner doesn't handle u SpanScanner doesn't handle UTF-16 surrogate pairs May 8, 2017
@nex3
Copy link
Member

nex3 commented May 16, 2017

I think the real issue here is that new SourceFile() should never have operated on code points in the first place. It's contrary to the rest of string handling throughout Dart, and I'd bet that all the code that actually handles spans assumes they refer to code units rather than code points.

nex3 added a commit to dart-lang/source_span that referenced this issue May 16, 2017
This behavior runs contrary to the rest of Dart's string handling, and
in particular breaks string_scanner. See dart-lang/string_scanner#4.
@leonsenft
Copy link
Author

I agree the issue stems from a choice made within SourceFile; however I still think operating at a code point level is an entirely valid use case. I'll post my reasoning on dart-lang/source_span#16.

nex3 added a commit to dart-lang/source_span that referenced this issue May 17, 2017
This behavior runs contrary to the rest of Dart's string handling, and
in particular breaks string_scanner. See dart-lang/string_scanner#4.
nex3 added a commit that referenced this issue May 17, 2017
@nex3 nex3 closed this as completed in #5 May 22, 2017
nex3 added a commit that referenced this issue May 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants