Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Location APIs for characters #1788

Closed
kddnewton opened this issue Nov 8, 2023 · 3 comments · Fixed by #1809
Closed

Location APIs for characters #1788

kddnewton opened this issue Nov 8, 2023 · 3 comments · Fixed by #1809
Labels
enhancement New feature or request

Comments

@kddnewton
Copy link
Collaborator

We store everything as an offset of bytes from the start of the source. This is correct and not going to change.

However lots of our consumers are dealing with encoded source strings and would like to be able to use the [] method without having to convert into # of characters. (Debatably, this should be called "grapheme_clusters", but I'm going with characters for now because it's shorter.)

The APIs that would be nice to provide:

  • Location#start_character_offset
  • Location#end_character_offset
  • Location#start_character_column
  • Location#end_character_column

A simple implementation of these would be:

class Location
  def start_character_offset = source.source.byteslice(0, start_offset).length
  def end_character_offset = source.source.byteslice(0, end_offset).length
  def start_character_column = source.slice(source.line_offset(start_offset), start_column).length
  def end_character_column = source.slice(source.line_offset(end_offset), end_column).length
end

However, this would mean a lot of things being recalculated any time any of these APIs were called. My hypothesis is that if you're calling these APIs once, it's likely you'll call them again. In this case I think we should implement some caching on the Source object for byte offset => character offset and use that cache in these computations.

@eregon
Copy link
Member

eregon commented Nov 9, 2023

However lots of our consumers are dealing with encoded source strings and would like to be able to use the [] method without having to convert into # of characters. (Debatably, this should be called "grapheme_clusters", but I'm going with characters for now because it's shorter.)

String#[] doesn't use grapheme clusters but codepoints (equivalent to Ruby's characters).
So character or codepoint in the name is good if meant to be used for String#[].
OTOH I don't think using String#[] is a good idea, that will imply byte->character conversions and character->byte conversions. For that using String#byteslice is a lot more efficent.

I'm not sure about editors and debuggers, do they use byte offsets, codepoint offsets, "UTF-16 code unit" offsets, UTF-16 codepoints offsets or grapheme clusters offsets? Maybe @vinistock / @joeldrapper know?

From a quick search I found microsoft/language-server-protocol#376 so it sounds like LSP used UTF-16 code units and now it's maybe possible to use utf-8 byte offsets and Unicode codepoints (same as Ruby characters) microsoft/language-server-protocol#376 (comment)?

@vinistock
Copy link
Contributor

For editors, it depends on the editor's implementation. Language servers negotiate the encoding on the initial handshake, but the server can only select an encoding from the list of possible encodings informed by the client.

For UTF-8 clients, then the concept of characters matches exactly what Ruby does. Regardless of how many bytes a unicode character takes, it will still be considered length 1.

For UTF-16 clients based on NodeJS, the situation is a little bit different. Based on my investigations with @nirvdrum and the context on microsoft/vscode#20170, NodeJS has an optimization on how it calculates the length of strings. This optimization leads to a weird scenario, where unicode characters often have length 2, but regular characters have length 1. This is a known problem for language servers, where we need to perform offset adjustments if the client is UTF-16 (Rust Analyzer and the Ruby LSP use similar mechanisms of counting the UTF-16 length of the unicode character). So I guess in this case it would be considered a code unit?

now it's maybe possible to use utf-8 byte offsets and Unicode codepoints

Unless I misunderstood something, I don't believe this is the case. The client announces the supported encodings and then the server has the opportunity to pick one of them. But VS Code only supports UTF-16 for example.

@eregon
Copy link
Member

eregon commented Nov 9, 2023

So I guess in this case it would be considered a code unit?

Yes, exactly, that's what UTF-16 code unit means, so a UTF-16 codepoint can be either 1 or 2 UTF-16 code units.
Another way to say it is Java/Node.js use UCS-2 and not true UTF-16 since they deal in code units for most APIs instead of in codepoints.

Unless I misunderstood something, I don't believe this is the case. The client announces the supported encodings and then the server has the opportunity to pick one of them. But VS Code only supports UTF-16 for example.

Right, client=VSCode and server=ruby-lsp, so ruby-lsp gets to choose except that currently VSCode the client only announces UTF-16, unfortunate. Rereading this comment it's clear UTF-16 is a mandatory encoding so anyway UTF-16 code units need to be handled in the server.

Regarding UTF-16 code units I think that's outside the scope of Prism and something that makes more sense in the tools to deal with.

For #1770, adding character/codepoint offsets seems convenient (if the other side cannot deal with byte offsets and need codepoint offsets, using byte offsets would be far more efficient as mentioned in my first comment). codepoint would be the clearer and best-defined term, character is quite ambiguous, OTOH Ruby String has each_char and chars (but also each_codepoint and codepoints).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants