Clarify measurement unit in CompletionsRequest.column #285

rillig · 2022-06-12T18:07:05Z

Is the column measured in:

bytes
UTF-16 code units
UTF-32 code units
visible characters (whatever a "character" may be)

Is 'column' even a good word, or should it rather be 'index'?
Is this column 1-based (depending on the initial handshake), or is it always 0-based?

connor4312 · 2022-06-24T20:05:47Z

The base should be determined by the initial handshake.

I believe column numbers, here and throughout the protocol, should be treated as UTF-8 code units. @weinand @roblourens thoughts?

rillig · 2022-06-24T21:48:32Z

should be treated as UTF-8 code points

There is no such concept as "UTF-8 code points".

There's only a Unicode code point, which is an integer number from the range 0 to 0x10_FFFF. Such a code point has a name and a plethora of other properties. It does not have an encoding though.

UTF-8 is one possible encoding that encodes a single Unicode code point as a sequence of numbers from the range 0 to 0xFF. Each Unicode encoding is based on a code unit, which is:

uint8_t for UTF-8
uint16_t for UTF-16
uint32_t for UTF-32

Mixing the abstract numbers and their encoding doesn't make sense.

Instead, choose a measurement unit from the below list:

UTF-32 code units
UTF-16 code units (the "natural" storage unit of Java and JavaScript)
UTF-8 code units (the "natural" storage unit of Go)
Visible columns in a fixed-width font (this one would get very messy)

connor4312 · 2022-06-24T22:02:11Z

Thanks for the correction, code ~~points~~ units

weinand · 2022-06-24T22:10:55Z

@connor4312 we should use the same as LSP.
Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string.
Renaming "column" is not an option.

connor4312 · 2022-06-24T22:29:03Z

On LSP:

Prior to 3.17 the offsets were always based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

Since 3.17 clients and servers can agree on a different string encoding representation (e.g. UTF-8). The client announces it’s supported encoding via the client capability general.positionEncodings. The value is an array of position encodings the client supports, with decreasing preference (e.g. the encoding at index 0 is the most preferred one).

To stay backwards compatible the only mandatory encoding is UTF-16 represented via the string utf-16. The server can pick one of the encodings offered by the client and signals that encoding back to the client via the initialize result’s property capabilities.positionEncoding. If the string value utf-16 is missing from the client’s capability general.positionEncodings servers can safely assume that the client supports UTF-16. If the server omits the position encoding in its initialize result the encoding defaults to the string value utf-16. Implementation considerations: since the conversion from one encoding into another requires the content of the file / line the conversion is best done where the file is read which is usually on the server side.

I suspect that implementors are doing things differently here. It looks like VS Code is using UTF-16 columns in the REPL at least, as completing on 𐐀 results in a completions request at the 1-based column 3.

Encoding is relevant since, as in this case, an individual character might span more than a single "column" depending on its width and the encoding.

puremourning · 2022-06-25T06:16:27Z

@connor4312 we should use the same as LSP.

Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string.

Renaming "column" is not an option.

Doing what LSP does is super controversial. And not ideal.

microsoft/language-server-protocol#376

roblourens · 2022-06-27T18:02:15Z

vscode's implementation would be however the editor describes columns, which I think would be using UTF-16 code units since that's what you get in JS. Maybe we need this in the spec.

connor4312 · 2022-06-27T18:24:36Z

If we're okay spec'ing something different than what VS Code currently implements in DAP, I would prefer to normalize on UTF-8 units as a more 'modern' and conceptually simpler standard.

roblourens · 2022-06-27T18:42:21Z

If we're going to change it, we would have to add the same infrastructure for a client/server agreeing on what to use, and that doesn't seem worth it.

weinand · 2022-06-27T21:41:06Z

In the context of the completions request it only makes sense that the column argument represents "Unicode code points", since there is no need to address individual bytes or words within the various UTF encodings.

Gedankenexperiment: instead of using a column argument we could use another string prefixText argument to denominate a location within the text. In that case there would be no need to talk about encodings, right?

rillig · 2022-06-27T22:46:20Z

Gedankenexperiment: instead of using a column argument we could use another string prefixText argument to denominate a location within the text. In that case there would be no need to talk about encodings, right?

That's an elegant approach. For further symmetry, it might be possible to split the text into prefixText and suffixText, thereby avoiding to repeat part of the text. If the suffixText is expected to be empty in most cases, it can also be made optional.

roblourens · 2022-06-27T23:36:26Z

That could help for completions, but every request that talks about columns could still have the confusion, right?

weinand · 2022-06-28T08:00:10Z

Of course, the "thought experiment" was not meant as a suggestion to change "column" to "prefixText" (because that would be a breaking change and we would have to introduce new "capabilities").

I tried to explain that if a string prefixText would be an acceptable replacement for column and the equalityprefixText == text.substring(0, column) holds, then we know the measurement of column.

puremourning · 2022-06-28T08:22:51Z

Long story short, we've discussed this before and it was stated that it was UTF-16 code units: I asked this before and @weinand said it's the same as LSP, which is UTF-16 code units: #91 (comment). I raised this: : puremourning/vimspector#96 at the time.

Fortunately, I haven't actually implemented vimspector/issue/96 so I'm open to changing it, but obviously I can't speak for other clients.

then we know the measurement of column

Kinda, if and only if we can define what len( string ) means and thus what substring_of( string, 0, column ) means. This being the crux of the challenge - in c, offsets and strlen() are going to work on bytes. IIRC it somewhat differs depending on implementation language even for languages that "natively" use unicode strings. Examples:

Java: String.length() returns the number of char values (i.e. code units, I believe UTF-16)
Python: len() of str returns number of codepoints (not code units in some encoding)
Javascript: String.length returns code units (again, utf-16 ?)

A practical approach is indeed to use the number of codepoints in some specific encoding. As the LSP maintainers have found, this quickly becomes a religious debate. The challenge of specifying it in terms of codepoints is that the whole thing gets complicated when you look at combining marks, grapheme clusters and oh my.

My personal preference is codepoints because my client happens to be written in python, but "utf-8 code units" (byte offset into utf-8 encoded version of the line) seems a popular choice (modulo loud dissenters).

weinand · 2022-07-14T16:28:09Z

@puremourning, as pointed out by @rillig above, there is no concept of an "encoding specific code point".
Do you mean "code units"?

puremourning · 2022-07-14T20:23:49Z

Yes, 'Number of code units in some specific encoding' is what I should have written.

weinand · 2022-08-11T17:39:58Z

Since the issue at hand asks for "Clarify measurement unit in CompletionsRequest.column", I first did an assessment of the status quo:

Assessment:

The description of CompletionsRequest.column says:

The character position for which to determine the completion proposals.

Since there is no mentioning of "encodings" or "code units" anywhere in the DAP, we can assume that the original intent of "character" was a high level concept like "visible character". In Unicode terminology this would probably translate at least to "code points" or to the even more abstract "grapheme clusters".

However, real world DAP implementations (both clients and debug adapters) don't implement this high level concept. Instead many of these DAP implementations use programming languages (JS, TS, C#, Java) where in-memory strings are ordered sequences of 16-bit unsigned integer values (UTF-16 code units). It is therefore natural that these implementations interpret "character positions" as offsets into the sequences of UTF-16 code units.

Please note that UTF-16 is only the in-memory encoding (which is relevant when implementing DAP). Sending DAP JSON over a communication mechanism (or storing it on disk) will most likely use UTF-8 encoding, but the decoding is typically done on a different layer and results in in-memory strings that have encoding of the underlying programming language (e.g. UTF-16).

Bottom line:

Today the "de-facto" measurement unit of CompletionsRequest.column is UTF-16 code units and whether it is 0- or 1-based is determined by the intitial handshake (i.e. columnsStartAt1 argument of the initialize request).

Proposal:

The clarification need is not confined to CompletionsRequest.column but applies to the following DAP elements as well:

Events:
	Output
Requests:
	BreakpointLocations
	Completions
	GotoTargets
Types:
	Breakpoint
	BreakpointLocation
	CompletionItem 
	Scope
	SourceBreakpoint
	StackFrame
	StepInTarget

The documentation for these DAP elements will be updated to reflect the "Bottom line" statement from above.

Since the measurement unit of "column" properties was not clearly specified before, it is unclear whether existing clients or debug adapters need to update their implementations. No implementation changes are planned for VS Code and its embedded js-debug.

Whether there is a need for introducing a configurable measurement unit for "column" properties can be discussed in an independent feature request.

What do you think?

Fixes #285

weinand self-assigned this Jun 22, 2022

weinand added the clarification Protocol clarification label Jun 22, 2022

weinand added this to the July 2022 milestone Jul 12, 2022

weinand modified the milestones: July 2022, On Deck, August 2022 Jul 21, 2022

kieferrm mentioned this issue Aug 7, 2022

Iteration Plan for August 2022 microsoft/vscode#157454

Closed

weinand added a commit that referenced this issue Aug 15, 2022

Clarify measurement unit in CompletionsRequest.column

8c8a88e

Fixes #285

weinand mentioned this issue Aug 15, 2022

Clarify measurement unit in CompletionsRequest.column #320

Merged

weinand closed this as completed in #320 Aug 15, 2022

puremourning mentioned this issue Aug 27, 2022

Offsets should be UTF-16 code units puremourning/vimspector#96

Open

connor4312 pushed a commit that referenced this issue Aug 30, 2022

Clarify measurement unit in CompletionsRequest.column (#320)

a5788c2

Fixes #285

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify measurement unit in CompletionsRequest.column #285

Clarify measurement unit in CompletionsRequest.column #285

rillig commented Jun 12, 2022 •

edited

Loading

connor4312 commented Jun 24, 2022 •

edited

Loading

rillig commented Jun 24, 2022 •

edited

Loading

connor4312 commented Jun 24, 2022

weinand commented Jun 24, 2022 •

edited

Loading

connor4312 commented Jun 24, 2022 •

edited

Loading

puremourning commented Jun 25, 2022

roblourens commented Jun 27, 2022

connor4312 commented Jun 27, 2022

roblourens commented Jun 27, 2022

weinand commented Jun 27, 2022

rillig commented Jun 27, 2022

roblourens commented Jun 27, 2022

weinand commented Jun 28, 2022

puremourning commented Jun 28, 2022 •

edited

Loading

weinand commented Jul 14, 2022 •

edited

Loading

puremourning commented Jul 14, 2022

weinand commented Aug 11, 2022 •

edited

Loading

Clarify measurement unit in CompletionsRequest.column #285

Clarify measurement unit in CompletionsRequest.column #285

Comments

rillig commented Jun 12, 2022 • edited Loading

connor4312 commented Jun 24, 2022 • edited Loading

rillig commented Jun 24, 2022 • edited Loading

connor4312 commented Jun 24, 2022

weinand commented Jun 24, 2022 • edited Loading

connor4312 commented Jun 24, 2022 • edited Loading

puremourning commented Jun 25, 2022

roblourens commented Jun 27, 2022

connor4312 commented Jun 27, 2022

roblourens commented Jun 27, 2022

weinand commented Jun 27, 2022

rillig commented Jun 27, 2022

roblourens commented Jun 27, 2022

weinand commented Jun 28, 2022

puremourning commented Jun 28, 2022 • edited Loading

weinand commented Jul 14, 2022 • edited Loading

puremourning commented Jul 14, 2022

weinand commented Aug 11, 2022 • edited Loading

Assessment:

Bottom line:

Proposal:

rillig commented Jun 12, 2022 •

edited

Loading

connor4312 commented Jun 24, 2022 •

edited

Loading

rillig commented Jun 24, 2022 •

edited

Loading

weinand commented Jun 24, 2022 •

edited

Loading

connor4312 commented Jun 24, 2022 •

edited

Loading

puremourning commented Jun 28, 2022 •

edited

Loading

weinand commented Jul 14, 2022 •

edited

Loading

weinand commented Aug 11, 2022 •

edited

Loading