incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

stevengj · 2017-05-16T03:36:02Z

As discussed in JuliaLang/IJulia.jl#541, it appears that Jupyter is sending an incorrect cursor_pos for completion requests when the string contains non-BMP Unicode characters.

For example, if I try to tab-complete the a cell containing 𝐚𝐚𝐚𝐚𝐚, it sends cursor_pos=10 even though the string has just 5 characters.

For these these non-BMP characters, the UTF-16 encoding used by JavaScript uses two 16-bit code units per character, rather than one as for most characters. So, maybe JavaScript's Unicode encoding is leaking into the messages (contrary to the spec, which explicitly says that cursor_pos is in "unicode characters").

The text was updated successfully, but these errors were encountered:

Carreau · 2017-05-16T04:04:41Z

Hum, thanks for catching this.
You would have asked me, I would have said it was bytes, not characters.

(BTW heard good things about the talk you gave today, seem like you never stop to code , even during conferences. )

minrk · 2017-05-16T11:30:54Z

This is indeed a bug in CodeMirror itself, where getCursor() counts 𝐚 as having length 2. Most other multi-byte characters I have tried correctly have length 1 (e.g. é or あ). I think CodeMirror just needs to learn about some more combining characters. From this issue, CodeMirror has to handle all combining characters, etc. itself since the underlying javascript is no help. My guess is this is just a unicode combining-range that CM doesn't know about yet.

stevengj · 2017-05-16T12:13:44Z

@minrk, I think you are confusing combining characters with surrogate pairs.

JavaScript (hence CodeMirror) uses the UTF-16 encoding to store Unicode strings: each string is stored as an array of 16-bit "code units", so every character is "multi-byte". Originally, people thought 16 bits would be enough to encode all of Unicode, and hence each code unit would equal one character. Unfortunately, people eventually realized that 16 bits was not enough, and newer characters (non-BMP characters) are encoded as two 16-bit code units (a "surrogate pair").

Both あ (U+3042) and é (U+00e9) are in the BMP, and hence encode as one 2-byte code unit, whereas 𝐚 (U+01d41a) is a newer addition (non-BMP) and requires two 2-byte code units (a surrogate pair) in UTF-16.

String indices and string lengths in JavaScript are measured in 16-bit code units, not characters, and apparently this is what CodeMirror is reporting.

stevengj · 2017-05-16T12:14:26Z

(This is one of the reasons why UTF-16 sucks: non-BMP characters are rare enough that bugs like this go undetected for a long time, especially since most programmers don't understand the encoding.)

stevengj · 2017-05-16T12:23:00Z

Fortunately, it is simple to write a conversion routine that calculates character indices from UTF-16 code-unit indices and vice versa. I will do that in IJulia for now as a workaround, and presumably you will want to do something like that in Jupyter as well. (Or in CodeMirror? But since JavaScript "natively" wants to use UTF-16 indices, they may be reluctant to change.)

stevengj · 2017-05-16T14:58:25Z

Note that Python has a mess here, too: in Python 2, len(u'𝐚𝐚𝐚𝐚𝐚') is 10 (on systems where Python uses UTF-16), while in Python 3.3+ len(u'𝐚𝐚𝐚𝐚𝐚') is 5.

minrk · 2017-05-19T22:36:48Z

@stevengj thanks for clarifying! I think it makes the most sense for the protocol spec to be 'characters', so if CodeMirror indices are returning UTF-16 units, it ought to be the responsibility of the Jupyter javascript to deal with surrogate pairs and turn that into actual character offsets.

minrk · 2017-05-19T22:51:18Z

Should be fixed by jupyter/notebook#2509 if I understood the spec correctly. I was able to reproduce these errors using the IPython kernel with 𝐚𝐚𝐚𝐚𝐚, which that PR fixes.

minrk · 2017-06-22T12:18:44Z

5.2 spec is published with this in jupyter-client 5.1.

Carreau added this to the 5.1 milestone May 16, 2017

minrk mentioned this issue May 16, 2017

incorrect cursor position for bold math text codemirror/codemirror5#4750

Closed

minrk mentioned this issue May 19, 2017

handle surrogate pairs in character offsets jupyter/notebook#2509

Merged

rgbkrk mentioned this issue May 22, 2017

handle surrogate pairs in character offsets nteract/nteract#1706

Closed

takluyver closed this as completed in jupyter/notebook#2509 May 23, 2017

minrk mentioned this issue May 23, 2017

describe cursor_pos ambiguity and bump protocol to 5.2 #262

Merged

minrk mentioned this issue Jun 8, 2017

Protect against hypothetical future where javascript stops using surrogate pairs jupyter/notebook#2560

Merged

stevengj mentioned this issue Jun 12, 2019

incorrect cursor_pos for non-BMP characters in Jupyter protocol nteract/hydrogen#807

Closed

xzackli mentioned this issue Dec 29, 2020

tab completion of surrogate pairs causes error, possible regression of #2255 jupyterlab/jupyterlab#9524

Open

stevengj mentioned this issue Feb 10, 2023

incorrect cursor_pos for completion requests with non-BMP Unicode characters jupyterlab/jupyterlab#13961

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

stevengj commented May 16, 2017 •

edited

Loading

Carreau commented May 16, 2017

minrk commented May 16, 2017 •

edited

Loading

stevengj commented May 16, 2017

stevengj commented May 16, 2017 •

edited

Loading

stevengj commented May 16, 2017

stevengj commented May 16, 2017

minrk commented May 19, 2017

minrk commented May 19, 2017

minrk commented Jun 22, 2017

incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

Comments

stevengj commented May 16, 2017 • edited Loading

Carreau commented May 16, 2017

minrk commented May 16, 2017 • edited Loading

stevengj commented May 16, 2017

stevengj commented May 16, 2017 • edited Loading

stevengj commented May 16, 2017

stevengj commented May 16, 2017

minrk commented May 19, 2017

minrk commented May 19, 2017

minrk commented Jun 22, 2017

stevengj commented May 16, 2017 •

edited

Loading

minrk commented May 16, 2017 •

edited

Loading

stevengj commented May 16, 2017 •

edited

Loading