Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

Closed
stevengj opened this issue May 16, 2017 · 9 comments · Fixed by jupyter/notebook#2509
Milestone

Comments

@stevengj
Copy link

stevengj commented May 16, 2017

As discussed in JuliaLang/IJulia.jl#541, it appears that Jupyter is sending an incorrect cursor_pos for completion requests when the string contains non-BMP Unicode characters.

For example, if I try to tab-complete the a cell containing 𝐚𝐚𝐚𝐚𝐚, it sends cursor_pos=10 even though the string has just 5 characters.

For these these non-BMP characters, the UTF-16 encoding used by JavaScript uses two 16-bit code units per character, rather than one as for most characters. So, maybe JavaScript's Unicode encoding is leaking into the messages (contrary to the spec, which explicitly says that cursor_pos is in "unicode characters").

@Carreau Carreau added this to the 5.1 milestone May 16, 2017
@Carreau
Copy link
Member

Carreau commented May 16, 2017

Hum, thanks for catching this.
You would have asked me, I would have said it was bytes, not characters.

(BTW heard good things about the talk you gave today, seem like you never stop to code , even during conferences. )

@minrk
Copy link
Member

minrk commented May 16, 2017

This is indeed a bug in CodeMirror itself, where getCursor() counts 𝐚 as having length 2. Most other multi-byte characters I have tried correctly have length 1 (e.g. é or あ). I think CodeMirror just needs to learn about some more combining characters. From this issue, CodeMirror has to handle all combining characters, etc. itself since the underlying javascript is no help. My guess is this is just a unicode combining-range that CM doesn't know about yet.

@stevengj
Copy link
Author

@minrk, I think you are confusing combining characters with surrogate pairs.

JavaScript (hence CodeMirror) uses the UTF-16 encoding to store Unicode strings: each string is stored as an array of 16-bit "code units", so every character is "multi-byte". Originally, people thought 16 bits would be enough to encode all of Unicode, and hence each code unit would equal one character. Unfortunately, people eventually realized that 16 bits was not enough, and newer characters (non-BMP characters) are encoded as two 16-bit code units (a "surrogate pair").

Both (U+3042) and é (U+00e9) are in the BMP, and hence encode as one 2-byte code unit, whereas 𝐚 (U+01d41a) is a newer addition (non-BMP) and requires two 2-byte code units (a surrogate pair) in UTF-16.

String indices and string lengths in JavaScript are measured in 16-bit code units, not characters, and apparently this is what CodeMirror is reporting.

@stevengj
Copy link
Author

stevengj commented May 16, 2017

(This is one of the reasons why UTF-16 sucks: non-BMP characters are rare enough that bugs like this go undetected for a long time, especially since most programmers don't understand the encoding.)

@stevengj
Copy link
Author

Fortunately, it is simple to write a conversion routine that calculates character indices from UTF-16 code-unit indices and vice versa. I will do that in IJulia for now as a workaround, and presumably you will want to do something like that in Jupyter as well. (Or in CodeMirror? But since JavaScript "natively" wants to use UTF-16 indices, they may be reluctant to change.)

@stevengj
Copy link
Author

Note that Python has a mess here, too: in Python 2, len(u'𝐚𝐚𝐚𝐚𝐚') is 10 (on systems where Python uses UTF-16), while in Python 3.3+ len(u'𝐚𝐚𝐚𝐚𝐚') is 5.

@minrk
Copy link
Member

minrk commented May 19, 2017

@stevengj thanks for clarifying! I think it makes the most sense for the protocol spec to be 'characters', so if CodeMirror indices are returning UTF-16 units, it ought to be the responsibility of the Jupyter javascript to deal with surrogate pairs and turn that into actual character offsets.

@minrk
Copy link
Member

minrk commented May 19, 2017

Should be fixed by jupyter/notebook#2509 if I understood the spec correctly. I was able to reproduce these errors using the IPython kernel with 𝐚𝐚𝐚𝐚𝐚, which that PR fixes.

@minrk
Copy link
Member

minrk commented Jun 22, 2017

5.2 spec is published with this in jupyter-client 5.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants