ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141

dan42 · 2019-08-06T13:56:46Z

The current behavior is certainly correct in some abstract Unicode-consortium perspective, but from a practical perspective, for programmers using regular expressions, it will usually produce an incorrect and efficient result.

For example I find the following cases problematic:

str.scan(/[a-z]/i)
Most programmers' notion of "lowercase and uppercase alphabet" does not include U+017F and U+212A.

str.scan(/\w/) != str.scan(/[a-z_\d]/i)
Most programmers would perceive these two regexes to be equivalent, except they are not.

str.scan(/<script/i)
Most programmers' notion of "html script tag" does not include <ſcript

etc.

The only people who might want /[a-z]/i to match U+017F are people handling Fraktur and Gaelic languages, and in any case they would use /\p{LC}/ or such.

The only people who might want /[a-z]/i to match U+212A are ???

These two characters are very rare, so in > 99.9% of cases programmers will never encounter a problem. This makes the very tiny number of edge cases all the more tricky. And we must still pay the de-optimization penalty in all cases because these two characters are multibyte.

Related to this, I believe that /(?a:[a-z])/i (ascii subgroup) should not match U+017F and U+212A.

The text was updated successfully, but these errors were encountered:

k-takata · 2019-08-08T17:43:20Z

This is the same behavior as perl:

$ perl -Mutf8 -e 'if ("ſ" =~ /(?a)s/i) {print "match"}'
match

k-takata · 2019-08-09T04:44:23Z

Perl supports /(?aa)/i, but Onigmo doesn't support it (yet).

dan42 · 2019-08-10T01:15:05Z

Hmm, it looks like the onigmo/ruby default mode is like perl's ascii mode? In perl unicode mode /\d/ is same as /\p{Digit}/ but /\d/a is like ruby /\d/, matching only [0-9]. I apologize, I don't know very well the delineation between onigmo and ruby. But it looks like in ruby, /\w/ behaves like /\w/u but differently from /(?u)\w/. So I think that ruby enables onigmo's ascii mode by default?

It looks like the only difference between modes "a" and "aa" are those two characters. I did a search for all case-insensitive equivalences, and those were the only ones where ascii and non-ascii characters were mixed (see below). If "aa" mode was implemented we could push for ruby to adopt it as default mode. So please pretty please. m(_ _)m

["A", "a"]
["B", "b"]
["C", "c"]
["D", "d"]
["E", "e"]
["F", "f"]
["G", "g"]
["H", "h"]
["I", "i"]
["J", "j"]
["K", "k", "K"]
["L", "l"]
["M", "m"]
["N", "n"]
["O", "o"]
["P", "p"]
["Q", "q"]
["R", "r"]
["S", "s", "ſ"]
["T", "t"]
["U", "u"]
["V", "v"]
["W", "w"]
["X", "x"]
["Y", "y"]
["Z", "z"]
["µ", "Μ", "μ", "൜", "൵", "ർ", "ᵜ", "ᵵ", "ᵼ", "ⵜ", "\u2D75", "\u2D7C", "㵜", "㵵", "㵼", "䵜", "䵵", "䵼", "嵜", "嵵", "嵼", "浜", "浵", "浼", "絜", "絵", "絼", "赜", "赵", "赼", "鵜", "鵵", "鵼", "굜", "굵", "굼", "뵜", "뵵", "뵼", "최", "쵵", "쵼", "", "", "", "ﵜ", "ﵵ", "ﵼ"]
["À", "à"]
["Á", "á"]
["Â", "â"]
["Ã", "ã"]
["Ä", "ä"]
["Å", "å", "Å"]
["Æ", "æ"]
["Ç", "ç"]
etc, all non-ascii

dan42 · 2019-08-10T01:45:19Z

This is more complicated than I expected... in ruby, /[[:alpha]]/ behaves like /(?u)[[:alpha]]/ but differently from /(?a)[[:alpha]]/

So I have no idea what the defaults are anymore.

dan42 · 2019-08-26T19:50:04Z

After reading the code I finally managed to understand that ruby default mode (?d) is a mix between unicode mode (?u) and ascii mode (?a). If mode (?aa) was added to Onigmo it would be possible to switch ruby default mode to a mix of (?u) and (?aa) instead.

results for (all utf8 from U+0000 to U+FFFF).grep(regexp mode + expr).size

	(?d)	(?u)	(?a)	(?aa)	comment
\d	10	370	10
\w	63	50567	63
\s	5	24	5
[[:digit:]]	370	370	10
[[:word:]]	50561	50561	63
[[:alpha:]]	49655	49655	52
[[:blank:]]	18	18	2
[[:space:]]	24	24	5
[A-Za-z]	52	52	52
(?i)[a-z]	54	54	54	52	U+017F and U+212A
a\b	1	1	1
あ\b	1	1	0
st	0	0	0
(?i)st	2	2	2	0	ligatures U+FB05 and U+FB06

tonco-miyazawa mentioned this issue May 23, 2022

An error occurred while using case-insensitive option (?i) . #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141

ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141

dan42 commented Aug 6, 2019 •

edited

Loading

k-takata commented Aug 8, 2019

k-takata commented Aug 9, 2019

dan42 commented Aug 10, 2019

dan42 commented Aug 10, 2019

dan42 commented Aug 26, 2019 •

edited

Loading

ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141

ſ(U+017F) and K(U+212A) should not be case-insensitive equivalent to S and K #141

Comments

dan42 commented Aug 6, 2019 • edited Loading

k-takata commented Aug 8, 2019

k-takata commented Aug 9, 2019

dan42 commented Aug 10, 2019

dan42 commented Aug 10, 2019

dan42 commented Aug 26, 2019 • edited Loading

dan42 commented Aug 6, 2019 •

edited

Loading

dan42 commented Aug 26, 2019 •

edited

Loading