Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify measurement unit in CompletionsRequest.column #285

Closed
rillig opened this issue Jun 12, 2022 · 17 comments · Fixed by #320
Closed

Clarify measurement unit in CompletionsRequest.column #285

rillig opened this issue Jun 12, 2022 · 17 comments · Fixed by #320
Assignees
Labels
clarification Protocol clarification
Milestone

Comments

@rillig
Copy link
Contributor

rillig commented Jun 12, 2022

Is the column measured in:

  • bytes
  • UTF-16 code units
  • UTF-32 code units
  • visible characters (whatever a "character" may be)

Is 'column' even a good word, or should it rather be 'index'?
Is this column 1-based (depending on the initial handshake), or is it always 0-based?

@weinand weinand self-assigned this Jun 22, 2022
@weinand weinand added the clarification Protocol clarification label Jun 22, 2022
@connor4312
Copy link
Member

connor4312 commented Jun 24, 2022

The base should be determined by the initial handshake.

I believe column numbers, here and throughout the protocol, should be treated as UTF-8 code units. @weinand @roblourens thoughts?

@rillig
Copy link
Contributor Author

rillig commented Jun 24, 2022

should be treated as UTF-8 code points

There is no such concept as "UTF-8 code points".

There's only a Unicode code point, which is an integer number from the range 0 to 0x10_FFFF. Such a code point has a name and a plethora of other properties. It does not have an encoding though.

UTF-8 is one possible encoding that encodes a single Unicode code point as a sequence of numbers from the range 0 to 0xFF. Each Unicode encoding is based on a code unit, which is:

  • uint8_t for UTF-8
  • uint16_t for UTF-16
  • uint32_t for UTF-32

Mixing the abstract numbers and their encoding doesn't make sense.

Instead, choose a measurement unit from the below list:

  • UTF-32 code units
  • UTF-16 code units (the "natural" storage unit of Java and JavaScript)
  • UTF-8 code units (the "natural" storage unit of Go)
  • Visible columns in a fixed-width font (this one would get very messy)

@connor4312
Copy link
Member

Thanks for the correction, code points units

@weinand
Copy link
Contributor

weinand commented Jun 24, 2022

@connor4312 we should use the same as LSP.
Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string.
Renaming "column" is not an option.

@connor4312
Copy link
Member

connor4312 commented Jun 24, 2022

On LSP:

Prior to 3.17 the offsets were always based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

Since 3.17 clients and servers can agree on a different string encoding representation (e.g. UTF-8). The client announces it’s supported encoding via the client capability general.positionEncodings. The value is an array of position encodings the client supports, with decreasing preference (e.g. the encoding at index 0 is the most preferred one).

To stay backwards compatible the only mandatory encoding is UTF-16 represented via the string utf-16. The server can pick one of the encodings offered by the client and signals that encoding back to the client via the initialize result’s property capabilities.positionEncoding. If the string value utf-16 is missing from the client’s capability general.positionEncodings servers can safely assume that the client supports UTF-16. If the server omits the position encoding in its initialize result the encoding defaults to the string value utf-16. Implementation considerations: since the conversion from one encoding into another requires the content of the file / line the conversion is best done where the file is read which is usually on the server side.

I suspect that implementors are doing things differently here. It looks like VS Code is using UTF-16 columns in the REPL at least, as completing on 𐐀 results in a completions request at the 1-based column 3.

Encoding is relevant since, as in this case, an individual character might span more than a single "column" depending on its width and the encoding.

@puremourning
Copy link
Contributor

@connor4312 we should use the same as LSP.

Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string.

Renaming "column" is not an option.

Doing what LSP does is super controversial. And not ideal.

microsoft/language-server-protocol#376

@roblourens
Copy link
Member

vscode's implementation would be however the editor describes columns, which I think would be using UTF-16 code units since that's what you get in JS. Maybe we need this in the spec.

@connor4312
Copy link
Member

If we're okay spec'ing something different than what VS Code currently implements in DAP, I would prefer to normalize on UTF-8 units as a more 'modern' and conceptually simpler standard.

@roblourens
Copy link
Member

If we're going to change it, we would have to add the same infrastructure for a client/server agreeing on what to use, and that doesn't seem worth it.

@weinand
Copy link
Contributor

weinand commented Jun 27, 2022

In the context of the completions request it only makes sense that the column argument represents "Unicode code points", since there is no need to address individual bytes or words within the various UTF encodings.

Gedankenexperiment: instead of using a column argument we could use another string prefixText argument to denominate a location within the text. In that case there would be no need to talk about encodings, right?

@rillig
Copy link
Contributor Author

rillig commented Jun 27, 2022

Gedankenexperiment: instead of using a column argument we could use another string prefixText argument to denominate a location within the text. In that case there would be no need to talk about encodings, right?

That's an elegant approach. For further symmetry, it might be possible to split the text into prefixText and suffixText, thereby avoiding to repeat part of the text. If the suffixText is expected to be empty in most cases, it can also be made optional.

@roblourens
Copy link
Member

That could help for completions, but every request that talks about columns could still have the confusion, right?

@weinand
Copy link
Contributor

weinand commented Jun 28, 2022

Of course, the "thought experiment" was not meant as a suggestion to change "column" to "prefixText" (because that would be a breaking change and we would have to introduce new "capabilities").

I tried to explain that if a string prefixText would be an acceptable replacement for column and the equalityprefixText == text.substring(0, column) holds, then we know the measurement of column.

@puremourning
Copy link
Contributor

puremourning commented Jun 28, 2022

Long story short, we've discussed this before and it was stated that it was UTF-16 code units: I asked this before and @weinand said it's the same as LSP, which is UTF-16 code units: #91 (comment). I raised this: : puremourning/vimspector#96 at the time.

Fortunately, I haven't actually implemented vimspector/issue/96 so I'm open to changing it, but obviously I can't speak for other clients.


then we know the measurement of column

Kinda, if and only if we can define what len( string ) means and thus what substring_of( string, 0, column ) means. This being the crux of the challenge - in c, offsets and strlen() are going to work on bytes. IIRC it somewhat differs depending on implementation language even for languages that "natively" use unicode strings. Examples:

  • Java: String.length() returns the number of char values (i.e. code units, I believe UTF-16)
  • Python: len() of str returns number of codepoints (not code units in some encoding)
  • Javascript: String.length returns code units (again, utf-16 ?)

A practical approach is indeed to use the number of codepoints in some specific encoding. As the LSP maintainers have found, this quickly becomes a religious debate. The challenge of specifying it in terms of codepoints is that the whole thing gets complicated when you look at combining marks, grapheme clusters and oh my.

My personal preference is codepoints because my client happens to be written in python, but "utf-8 code units" (byte offset into utf-8 encoded version of the line) seems a popular choice (modulo loud dissenters).

@weinand weinand added this to the July 2022 milestone Jul 12, 2022
@weinand
Copy link
Contributor

weinand commented Jul 14, 2022

@puremourning, as pointed out by @rillig above, there is no concept of an "encoding specific code point".
Do you mean "code units"?

@puremourning
Copy link
Contributor

Yes, 'Number of code units in some specific encoding' is what I should have written.

@weinand
Copy link
Contributor

weinand commented Aug 11, 2022

Since the issue at hand asks for "Clarify measurement unit in CompletionsRequest.column", I first did an assessment of the status quo:

Assessment:

The description of CompletionsRequest.column says:

The character position for which to determine the completion proposals.

Since there is no mentioning of "encodings" or "code units" anywhere in the DAP, we can assume that the original intent of "character" was a high level concept like "visible character". In Unicode terminology this would probably translate at least to "code points" or to the even more abstract "grapheme clusters".

However, real world DAP implementations (both clients and debug adapters) don't implement this high level concept. Instead many of these DAP implementations use programming languages (JS, TS, C#, Java) where in-memory strings are ordered sequences of 16-bit unsigned integer values (UTF-16 code units). It is therefore natural that these implementations interpret "character positions" as offsets into the sequences of UTF-16 code units.

Please note that UTF-16 is only the in-memory encoding (which is relevant when implementing DAP). Sending DAP JSON over a communication mechanism (or storing it on disk) will most likely use UTF-8 encoding, but the decoding is typically done on a different layer and results in in-memory strings that have encoding of the underlying programming language (e.g. UTF-16).

Bottom line:

Today the "de-facto" measurement unit of CompletionsRequest.column is UTF-16 code units and whether it is 0- or 1-based is determined by the intitial handshake (i.e. columnsStartAt1 argument of the initialize request).

Proposal:

The clarification need is not confined to CompletionsRequest.column but applies to the following DAP elements as well:

Events:
	Output
Requests:
	BreakpointLocations
	Completions
	GotoTargets
Types:
	Breakpoint
	BreakpointLocation
	CompletionItem 
	Scope
	SourceBreakpoint
	StackFrame
	StepInTarget

The documentation for these DAP elements will be updated to reflect the "Bottom line" statement from above.

Since the measurement unit of "column" properties was not clearly specified before, it is unclear whether existing clients or debug adapters need to update their implementations. No implementation changes are planned for VS Code and its embedded js-debug.

Whether there is a need for introducing a configurable measurement unit for "column" properties can be discussed in an independent feature request.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Protocol clarification
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@roblourens @weinand @connor4312 @rillig @puremourning and others