note about character counts [I18N-ISSUE-496] #4

aphillips · 2015-10-23T15:58:01Z

http://www.w3.org/International/track/issues/496 [I18N-ISSUE-496]

http://www.w3.org/TR/2015/WD-findtext-20151015/#introduction

In the introduction there is this note:

For character counts in ranges, what exactly would be counted as a character? Unicode code points? Graphemes?

This is an important distinction.

I would suggest that the most commonly needed offset will be either in (a) Unicode code points or (b) in UTF-16 code units.

The latter would be best for JavaScript and DOM access, which are based on UTF-16 and thus would allow direct application of APIs such as substring().

The former would be better from a pure API perspective and for computing things such as string length in "characters".

Grapheme clusters can be complex and, while APIs may wish to find grapheme boundaries or to avoid splitting withing a grapheme , it is rarely the case that API access should be in these terms. Indeed, in some cases, it may be desirable to find text withing a grapheme and not the entire thing.

The text was updated successfully, but these errors were encountered:

tkanai · 2015-11-12T02:21:27Z

Indeed that the (a) is an appropriate and an ideal solution, definitely, but I can't find strong reasons to make FindText API isolated from other HTML APIs, especially from Range Object.
Only from this perspective I think we had to take (b), but can I expect that the other APIs/ES7? would be updated to use code points in the near future? If it is yes, let's align with the movement and take (a).

iherman · 2015-11-12T04:52:26Z

@tkanai you make a good point that we have to align on what current HTML API-s, ie, current browsers do. If we get to the point where browsers would implement this (or part of it), that alignment becomes crucial. I have no idea (and nobody knows in detail, I guess) where ES7 will go; we can make an update of the spec if that change really occurs.

B.t.w., I wonder whether this note does not also affects the editing distance.

tkanai · 2015-11-12T07:01:15Z

Here are the test results of newly introduced String functions in ES6.

var yoshinoya = "𠮷野屋";
The string consists of three letters. The first letter is in Unicode BMP. It means it is not possible to describe within 16bits.

var identical = yoshinoya === String.fromCodePoint(0x20BB7, 0x91ce, 0x5c4b) ? "yes" : "no"; /// yes
var identical = yoshinoya === String.fromCharCode(0xd842, 0xdfb7, 0x91ce, 0x5c4b) ? "yes" : "no"; /// yes
I guess fromCodePoint() is a function which splits each arg (> 0x10ffff) in two, and throw the args to fromCharCode(). Then it generates code unit basis String object regardless where it is from.

yoshinoya.codePointAt(0).toString(16); /// 20bb7
yoshinoya.charCodeAt(0).toString(16); /// d842
Looks good.

yoshinoya.codePointAt(1).toString(16); /// dfb7 !!!
yoshinoya.charCodeAt(1).toString(16); /// dfb7

Not good. I was expecting code-point basis indexing for codePointAt(). It appears to me it is still on code-unit basis indexing.

Regarding Editing distance, I think codePointAt() would work for it, but it calls for a custom indexing which shifts index in case an obtained code is in specific ranges, such as codes in Low Surrogate.

aphillips changed the title ~~note about character counts~~ note about character counts [I18N-ISSUE-496] Oct 23, 2015

azaroth42 added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Oct 26, 2015

tkanai mentioned this issue May 16, 2016

(model) vague definition of character position for text position selector w3c/web-annotation#206

Closed

r12a mentioned this issue Feb 18, 2020

note about character counts w3c/i18n-activity#102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

note about character counts [I18N-ISSUE-496] #4

note about character counts [I18N-ISSUE-496] #4

aphillips commented Oct 23, 2015

tkanai commented Nov 12, 2015

iherman commented Nov 12, 2015

tkanai commented Nov 12, 2015

note about character counts [I18N-ISSUE-496] #4

note about character counts [I18N-ISSUE-496] #4

Comments

aphillips commented Oct 23, 2015

tkanai commented Nov 12, 2015

iherman commented Nov 12, 2015

tkanai commented Nov 12, 2015