-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(model) vague definition of character position for text position selector #206
Comments
Related to this discussion: We clearly need to be more explicit in the description. |
I wonder whether the section on normalization of the character model is not relevant here (although that text is primarily aimed at martching). Ie, the content is supposed to be transformed into NFC, and the text position would then be understood to be the result of that transformation. Cc: @r12a @aphillips |
I expect that the i18n WG will discuss this and provide a more formal answer. In the meantime, maybe this can help: The above links make the fundamental point that text pointers should use character boundaries, not bytes. Having said that, because of backwards compatibility requirements, Unicode often allows two canonically equivalent forms such as U+00E1 LATIN SMALL LETTER A WITH ACUTE vs. U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT. So there are cases where if you are matching text containing á you'd want to normalise the representation (usually to a precomposed form) to make the match work. If you are simply pointing to a position in the text, however, i'm not sure that you need to normalise. On the other hand, you may want to take into account the fact that U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT is not something that ought to be split by a selection. For this, you need to consider the text as a series of grapheme clusters. |
@iherman Transforms such as NFC might actually interfere with the intention of web annotation, since the normalization of the text has the potential to change the number of code points/code units in the text, making it harder to identify the location intended. In addition, normalization might make some selections (in a purposefully de-normalized text) impossible. It is usually best to define offset in terms of Unicode code points. I see @r12a just provided a bunch of reference while typing this, so I'll leave it at that. |
Discussed on telco 2016.05.06: follow on @aphillips and @r12a advice and use Code Points. The example in the comment will be reused in the document. |
At this moment, HTML APIs and Javascript do NOT support Code Point base String indexing at all. Then, what I get from the recommendations is that web annotation client systems which work on browsers need to walk through text from the beginning for both indexing and text selection purposes. Is my understanding correct? Here is what I have pointed out in FindText API Issue #4. |
Discussed it again at the F2F 17.05.16, and kept to the previous resolution. |
I'd like to bring to this issues attention a recent query against the EPUBCFI specification, which may be used as fragment identifiers within this spec, where it was (tentatively?) decided that character positions be defined in terms of UTF-16 code units. Is this specification in danger of being too difficult to implement atop the current state of deployed ECMAScript and browser DOM APIs (where UTF-16 code units are the lingua franca)? Are changes to browser DOM APIs (for example allowing unicode code point indexing within DOM Ranges and others) imminent? |
Just trying to reproduce the discussion on the first F2F meeting: there are JS libraries available to handle code points properly. Ie, although all this may be difficult to implement on top of current, built-in JS in browsers, it can be implemented nevertheless (note that this is exactly the role of the Candidate Recommendation phase: to check whether the implementation is implementable...). The feeling on the meeting was that we had better be forward looking in this respect. (When using other environments than browsers it seems that the issue is easy to handle, because other languages seem to be more advanced in this respect...) |
Just to forward additional information from the IDPF EPUB WG issue list, w3c/epub-specs#555 (comment) may be of interest here (this is the issue @mark-buer referred to concerning EPUBCFI). (Not being an expert in the area just forwarding the information...) |
Interesting, but not a deal breaker for Text * Selector being defined in terms of code units. The equivalent plain text fragment URI spec is defined in terms of code points, so either way we would be at odds with one of them.
|
As @r12a mentioned, it is probably best to address grapheme cluster boundaries instead of character boundaries. If you go down the UTF-16 path then implementations should use UTF-16 and not USC-2. Its worth noting the problems that javascript traditionally has with characters outside the BMP. |
I agree that code points is the best. It is a bit of work in JavaScript (without a library), but will make it much easier for everything outside a browser. |
(chair hat off) Working in UTF-16 code units has certain advantages, particularly for JavaScript programmers. Some downsides of defining things in UTF-16 code units should be kept in mind:
On the flip side, a number of other specifications do specify things in terms of UTF-16 and UTF-16 is JavaScript's native encoding internally. It may be that the additional implementation complexity of counting code points turns out not to be worth the overhead. If you do go with code units, be sure that it is clear that this does not extend to code units in various legacy (non-Unicode) character encodings that are still sometimes used for storing resources used on the Web. |
I prefer code unit. IIRC Maciej pointed out in the last TPAC that Find Text should be compatible with DOM Range. I can't agree more. I understand code point looks safe, but when it points the middle of a grapheme cluster, it looks to me that the problems are similar to when it points the middle of a code point. Assuming we all want grapheme to be handed properly, I see little benefits in code point. The use of grapheme has benefits, but UAX#29 allows tailoring, which can make the pointer ambiguous. Each spec then should require to adjust appropriately. Recently we fixed Range.getClientRects() to handle grapheme correctly. It's true that needing to fix all such is troublesome, but code point doesn't save us from doing it anyway. |
Decision with the I18N WG (2016-05-26) and, on their advise is to use Code Points. The Anno WG accepted this. An additional text will be provided by the I18N WG on warning at the possible pitfalls, this will be added as a note in the document. See: http://www.w3.org/2016/05/26-i18n-irc#T15-40-45 and http://www.w3.org/2016/05/26-i18n-irc#T15-50-25 |
Have added this text for now, and will replace with better text provided by i18n when available.
|
This has been added, currently, to the Text Quote Selector. Shouldn't this be placed at the Text Position Selector? |
I put it as part of the normalization text, which is then referenced from Text Position Selector. |
The text position selector spec doesn't define the exact meaning of the term "character position".
There are many possible definitions. "character position" might be measured in units of UTF code points, UTF-8 code units, UTF-16 code units etc.
Interoperability issues will result if different implementations assume incompatible meanings.
The text was updated successfully, but these errors were encountered: