Location APIs for characters #1788

kddnewton · 2023-11-08T22:15:21Z

We store everything as an offset of bytes from the start of the source. This is correct and not going to change.

However lots of our consumers are dealing with encoded source strings and would like to be able to use the [] method without having to convert into # of characters. (Debatably, this should be called "grapheme_clusters", but I'm going with characters for now because it's shorter.)

The APIs that would be nice to provide:

Location#start_character_offset
Location#end_character_offset
Location#start_character_column
Location#end_character_column

A simple implementation of these would be:

class Location
  def start_character_offset = source.source.byteslice(0, start_offset).length
  def end_character_offset = source.source.byteslice(0, end_offset).length
  def start_character_column = source.slice(source.line_offset(start_offset), start_column).length
  def end_character_column = source.slice(source.line_offset(end_offset), end_column).length
end

However, this would mean a lot of things being recalculated any time any of these APIs were called. My hypothesis is that if you're calling these APIs once, it's likely you'll call them again. In this case I think we should implement some caching on the Source object for byte offset => character offset and use that cache in these computations.

The text was updated successfully, but these errors were encountered:

eregon · 2023-11-09T14:19:05Z

However lots of our consumers are dealing with encoded source strings and would like to be able to use the [] method without having to convert into # of characters. (Debatably, this should be called "grapheme_clusters", but I'm going with characters for now because it's shorter.)

String#[] doesn't use grapheme clusters but codepoints (equivalent to Ruby's characters).
So character or codepoint in the name is good if meant to be used for String#[].
OTOH I don't think using String#[] is a good idea, that will imply byte->character conversions and character->byte conversions. For that using String#byteslice is a lot more efficent.

I'm not sure about editors and debuggers, do they use byte offsets, codepoint offsets, "UTF-16 code unit" offsets, UTF-16 codepoints offsets or grapheme clusters offsets? Maybe @vinistock / @joeldrapper know?

From a quick search I found microsoft/language-server-protocol#376 so it sounds like LSP used UTF-16 code units and now it's maybe possible to use utf-8 byte offsets and Unicode codepoints (same as Ruby characters) microsoft/language-server-protocol#376 (comment)?

vinistock · 2023-11-09T14:34:06Z

For editors, it depends on the editor's implementation. Language servers negotiate the encoding on the initial handshake, but the server can only select an encoding from the list of possible encodings informed by the client.

For UTF-8 clients, then the concept of characters matches exactly what Ruby does. Regardless of how many bytes a unicode character takes, it will still be considered length 1.

For UTF-16 clients based on NodeJS, the situation is a little bit different. Based on my investigations with @nirvdrum and the context on microsoft/vscode#20170, NodeJS has an optimization on how it calculates the length of strings. This optimization leads to a weird scenario, where unicode characters often have length 2, but regular characters have length 1. This is a known problem for language servers, where we need to perform offset adjustments if the client is UTF-16 (Rust Analyzer and the Ruby LSP use similar mechanisms of counting the UTF-16 length of the unicode character). So I guess in this case it would be considered a code unit?

now it's maybe possible to use utf-8 byte offsets and Unicode codepoints

Unless I misunderstood something, I don't believe this is the case. The client announces the supported encodings and then the server has the opportunity to pick one of them. But VS Code only supports UTF-16 for example.

eregon · 2023-11-09T15:09:26Z

So I guess in this case it would be considered a code unit?

Yes, exactly, that's what UTF-16 code unit means, so a UTF-16 codepoint can be either 1 or 2 UTF-16 code units.
Another way to say it is Java/Node.js use UCS-2 and not true UTF-16 since they deal in code units for most APIs instead of in codepoints.

Unless I misunderstood something, I don't believe this is the case. The client announces the supported encodings and then the server has the opportunity to pick one of them. But VS Code only supports UTF-16 for example.

Right, client=VSCode and server=ruby-lsp, so ruby-lsp gets to choose except that currently VSCode the client only announces UTF-16, unfortunate. Rereading this comment it's clear UTF-16 is a mandatory encoding so anyway UTF-16 code units need to be handled in the server.

Regarding UTF-16 code units I think that's outside the scope of Prism and something that makes more sense in the tools to deal with.

For #1770, adding character/codepoint offsets seems convenient (if the other side cannot deal with byte offsets and need codepoint offsets, using byte offsets would be far more efficient as mentioned in my first comment). codepoint would be the clearer and best-defined term, character is quite ambiguous, OTOH Ruby String has each_char and chars (but also each_codepoint and codepoints).

kddnewton added the enhancement New feature or request label Nov 8, 2023

kddnewton mentioned this issue Nov 8, 2023

Locations report incorrect end_column when emoji are present #1770

Closed

kddnewton mentioned this issue Nov 12, 2023

Add character APIs for locations #1809

Merged

kddnewton closed this as completed in #1809 Nov 20, 2023

eregon mentioned this issue Feb 13, 2024

Add code unit APIs to location #2406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Location APIs for characters #1788

Location APIs for characters #1788

kddnewton commented Nov 8, 2023

eregon commented Nov 9, 2023 •

edited

Loading

vinistock commented Nov 9, 2023

eregon commented Nov 9, 2023

Location APIs for characters #1788

Location APIs for characters #1788

Comments

kddnewton commented Nov 8, 2023

eregon commented Nov 9, 2023 • edited Loading

vinistock commented Nov 9, 2023

eregon commented Nov 9, 2023

eregon commented Nov 9, 2023 •

edited

Loading