Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECMA 262: \d should only match ASCII digits #64

Open
fdutton opened this issue Apr 30, 2023 · 6 comments
Open

ECMA 262: \d should only match ASCII digits #64

fdutton opened this issue Apr 30, 2023 · 6 comments

Comments

@fdutton
Copy link

fdutton commented Apr 30, 2023

Given this pattern ^\d$

This should match: 0

And this should not: ߀

@fdutton fdutton changed the title [Question] Is a more recent ECMA 262 syntax supported? ECMA 262: \d should only match ASCII digits May 1, 2023
@enebo
Copy link
Member

enebo commented May 1, 2023

@fdutton on JRuby we behave as you describe. So something with our encodings will not match ߀ but does match 0. I am guessing you are using joni as a Java library so perhaps there is something config/call-wise which does behave this way?

Any extra info and we can try and figure out why we work and if we really are working how we get that result.

@enebo
Copy link
Member

enebo commented May 1, 2023

It looks like Ruby(JRuby) restricts numerics to only be ASCII explicitly: https://github.com/jruby/joni/blob/master/src/org/joni/Syntax.java#L459

@fdutton
Copy link
Author

fdutton commented May 1, 2023

I'll write some unit-tests but this is what I am doing to work around the issue.

// Joni is too liberal on some constructs
String s = regex
    .replace("\\d", "[0-9]")
    .replace("\\D", "[^0-9]")
    .replace("\\w", "[a-zA-Z0-9_]")
    .replace("\\W", "[^a-zA-Z0-9_]")
    .replace("\\s", "[ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]")
    .replace("\\S", "[^ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]");

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);

@enebo
Copy link
Member

enebo commented May 1, 2023

@fdutton I don't know where oniguruma repo is but you could check to see if syntax for ECMAScript was updated "up stream". We tend to look at the onigmo fork using by C Ruby but we are pretty far down stream. Perhaps there is a more up to date syntax?

@lopex
Copy link
Contributor

lopex commented May 1, 2023

@enebo I think we are still on par wrt regexp functionality. We've been tracking https://github.com/k-takata/Onigmo/graphs/contributors and there's not a lot of activity there. There's been more changes in MRI codebase lately though.

@lopex
Copy link
Contributor

lopex commented May 1, 2023

There also doesnt seem to be ecma syntax in neither Onigmo or MRI repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants