-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect cursor_pos for completion requests with non-BMP Unicode characters #259
Comments
Hum, thanks for catching this. (BTW heard good things about the talk you gave today, seem like you never stop to code , even during conferences. ) |
This is indeed a bug in CodeMirror itself, where |
@minrk, I think you are confusing combining characters with surrogate pairs. JavaScript (hence CodeMirror) uses the UTF-16 encoding to store Unicode strings: each string is stored as an array of 16-bit "code units", so every character is "multi-byte". Originally, people thought 16 bits would be enough to encode all of Unicode, and hence each code unit would equal one character. Unfortunately, people eventually realized that 16 bits was not enough, and newer characters (non-BMP characters) are encoded as two 16-bit code units (a "surrogate pair"). Both String indices and string lengths in JavaScript are measured in 16-bit code units, not characters, and apparently this is what CodeMirror is reporting. |
(This is one of the reasons why UTF-16 sucks: non-BMP characters are rare enough that bugs like this go undetected for a long time, especially since most programmers don't understand the encoding.) |
Fortunately, it is simple to write a conversion routine that calculates character indices from UTF-16 code-unit indices and vice versa. I will do that in IJulia for now as a workaround, and presumably you will want to do something like that in Jupyter as well. (Or in CodeMirror? But since JavaScript "natively" wants to use UTF-16 indices, they may be reluctant to change.) |
Note that Python has a mess here, too: in Python 2, |
@stevengj thanks for clarifying! I think it makes the most sense for the protocol spec to be 'characters', so if CodeMirror indices are returning UTF-16 units, it ought to be the responsibility of the Jupyter javascript to deal with surrogate pairs and turn that into actual character offsets. |
Should be fixed by jupyter/notebook#2509 if I understood the spec correctly. I was able to reproduce these errors using the IPython kernel with 𝐚𝐚𝐚𝐚𝐚, which that PR fixes. |
5.2 spec is published with this in jupyter-client 5.1. |
As discussed in JuliaLang/IJulia.jl#541, it appears that Jupyter is sending an incorrect
cursor_pos
for completion requests when the string contains non-BMP Unicode characters.For example, if I try to tab-complete the a cell containing
𝐚𝐚𝐚𝐚𝐚
, it sendscursor_pos=10
even though the string has just 5 characters.For these these non-BMP characters, the UTF-16 encoding used by JavaScript uses two 16-bit code units per character, rather than one as for most characters. So, maybe JavaScript's Unicode encoding is leaking into the messages (contrary to the spec, which explicitly says that
cursor_pos
is in "unicode characters").The text was updated successfully, but these errors were encountered: