-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141
Comments
This is the same behavior as perl:
|
Perl supports |
Hmm, it looks like the onigmo/ruby default mode is like perl's ascii mode? In perl unicode mode It looks like the only difference between modes "a" and "aa" are those two characters. I did a search for all case-insensitive equivalences, and those were the only ones where ascii and non-ascii characters were mixed (see below). If "aa" mode was implemented we could push for ruby to adopt it as default mode. So please pretty please. m(_ _)m
|
This is more complicated than I expected... in ruby, So I have no idea what the defaults are anymore. |
After reading the code I finally managed to understand that ruby default mode results for (all utf8 from U+0000 to U+FFFF).grep(regexp mode + expr).size
|
The current behavior is certainly correct in some abstract Unicode-consortium perspective, but from a practical perspective, for programmers using regular expressions, it will usually produce an incorrect and efficient result.
For example I find the following cases problematic:
str.scan(/[a-z]/i)
Most programmers' notion of "lowercase and uppercase alphabet" does not include U+017F and U+212A.
str.scan(/\w/) != str.scan(/[a-z_\d]/i)
Most programmers would perceive these two regexes to be equivalent, except they are not.
str.scan(/<script/i)
Most programmers' notion of "html script tag" does not include
<ſcript
etc.
The only people who might want
/[a-z]/i
to match U+017F are people handling Fraktur and Gaelic languages, and in any case they would use/\p{LC}/
or such.The only people who might want
/[a-z]/i
to match U+212A are ???These two characters are very rare, so in > 99.9% of cases programmers will never encounter a problem. This makes the very tiny number of edge cases all the more tricky. And we must still pay the de-optimization penalty in all cases because these two characters are multibyte.
Related to this, I believe that
/(?a:[a-z])/i
(ascii subgroup) should not match U+017F and U+212A.The text was updated successfully, but these errors were encountered: