Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement LSP for a multi-language kernel (SoS)? #282

Open
BoPeng opened this issue Jun 14, 2020 · 19 comments
Open

How to implement LSP for a multi-language kernel (SoS)? #282

BoPeng opened this issue Jun 14, 2020 · 19 comments

Comments

@BoPeng
Copy link

BoPeng commented Jun 14, 2020

Many thanks for your great work on language server support. I have just tried jupyterlab-lsp, which works great for Python and R, but unfortunately does not work for a multi-language kernel SoS that I have developed.

The idea behind of SoS is that it is a superkernel that sits between frontend and other kernels (see this illustration for details). It allows the use of multiple kernels in one notebook (through sos-notebook for classic jupyter and jupyterlab-sos for jupyterlab), and allows data exchange among live kernels.

The reason why jupyterlab-lsp does not work with SoS is simple: it does not know what language SoS is. If we are to solve this problem, there needs to be some way for SoS to notify jupyterlab-lsp the language used for each cell. I can work at both the frontend and backend (e.g. write a language server for SoS), but I am not sure if cell-level language support is at all possible with jupyterlab-lsp. I would appreciate any insight from the developers if and how this can be done. Thanks.

@bollwyvl
Copy link
Collaborator

Yeah, the Language Server Protocol Specification doesn't say anything about multi-language documents, so we're kinda shooting in the dark here. Further, basically 0 language servers care about Jupyter's JSON format, or any special syntax kernel authors have added on top of their host language(s).

Our detection is currently based on existing Jupyter approaches like file extension sniffing, contents manager introspection, or in the case of notebooks, the kernel and or notebook metadata. If everything is just sos, we can't do much for you, nor do we offer many hooks into this system at this point.

Presently, we do handle a small number of transclusions on the front-end, on a kernel-by-kernel basis, and it's rather deeply embedded inside the code. #191 discusses some approaches on how to normalize this, as a set of regular expressions + templates, or maybe some portable grammar and declarative transformation rules. If that's adopted, whether it's handled on the server side or the client side, there will be some hooks to extend it, ideally without having to rebuild the client (ha).

In a related effort, #268 (rough draft of an implementation on #278) suggests changing jupyter_lsp into a kernel, which handles all the management of language servers. If that approach is adopted, and your kernel supports kernel comms, you might be able to reuse the machinery there and offer your own solution... while that PoC presently treats the language server kernel as a singleton, it's important to me to not inject more "our way or the highway" pieces into the architecture: even for a single language server kernel implementation, it is important to be able to launch multiple instances that handle different documents, again without restarting your whole system.

However, as you've created a multi-language kernel with a special syntax, you've basically created a new language, which is certainly not unique: allthekernels, pidgy, metakernel are all in the same boat, protocol-and-bits-on-disk-wise. In all these cases, you might end up having to create a multi-language Language Server. There are a number of toolkits for different languages for doing so, e.g. pygls or vscode-languageserver-node, which might then in turn have to handle spinning up other language servers, as you really don't want to be writing all these things yourself. Costs aside, an investment in writing a Language Server can pay dividends through usability in any Language Server client.

Finally, there are also a number of upstream discussions occurring around this that may be worth your time to peruse:

@krassowski
Copy link
Member

krassowski commented Jun 14, 2020

Just super fast thought from me: we may want to suport this case and it would be super easy if we settle on per-cell language definitions, but it requires a longer discussion and a consensu in the wider Jupyter community.

Will elaborate next weekend

@rezaeir
Copy link

rezaeir commented Jun 14, 2020

@krassowski I think if lsp had this per-cell language definition it could work with SoS without much work on the SoS side because SoS kernel in a cell could be treated as another python kernel and its other functions don't have much of an overlap with lsp functionality. Am I wrong? @BoPeng

@BoPeng
Copy link
Author

BoPeng commented Jun 14, 2020

The problem could potentially be solved at the backend or frontend level.

If I am to implement an sos-language-server, it will of course try to start and use other language servers and act as a proxy. However, the language server protocol might not allow the passing of meta information to the server, so the sos language server might not be able to know the language of the content being passed. Hopefully the situation is not as bad as @bollwyvl said, "shooting in a dark".

It appears easier, and cleaner to implement this at the frontend level since jupyterlab-lsp is designed to work with multiple language servers anyway. It should be good enough for jupyterlab-lsp to know which language server to talk to at the cell level. SoS currently has some customized messages for changing cell level kernel (e.g. https://github.com/vatlab/jupyterlab-sos/blob/master/src/index.ts#L497), so it could be quite trivial, as @krassowski pointed out, if jupyterlab-lsp provides a hook/api for jupyterlab-sos to dynamically change the language of the kernel. I can work on a PR if this is allowed by the architecture, and acceptable to the team.

@bollwyvl
Copy link
Collaborator

However, the language server protocol might not allow the passing of meta information to the server,

lsp had this per-cell language definition

language server protocol might not allow the passing of meta information to the server

I wouldn't hold your breath trying to get changes into LSP! I may be very mistaken, but you'd have to make the case in such pitches very strongly that it would benefit microsoft and vscode pretty directly, and probably land some reference implementation there.

per-cell language definitions

While useful, this doesn't solve the larger problem of per-token transclusions, e.g. line magics, or query languages embedded in strings (#197). Further, this would probably require a breaking change to nbformat, and probably the jupyter kernel messaging protocol, neither of which like to be changed much.

so the sos language server might not be able to know the language of the content being passed

Assuming your files-on-disk can be statically analyzed by sos-language-server: the way it would work for a "pure" language server today:

  • user installs jupyter-lsp and sos-language-server
    • sos-language-serverregisters itself for whatever file extensions, mime types, and codemirror modes you created for the language sos:
      • we support traitlets (e.g. jupyter_notebook_config.json) and setuptools entry_points
  • jupyter-lsp, would advertise the sos spec on its REST API
  • when a sos kernel session gets started, finding the sos declaration jupyterlab-lsp would open a new websocket for sos, to be used for all sos documents
  • jupyterlab-lsp would start the LSP session with initialize
    • jupyter-lsp would proxy this and all messages verbatim to sos-language-server
    • sos-language-server
  • jupyterlab-lsp would finish setup with configuration/didChange (Language server configuration using the Advanced Settings Editor #245), textDocument/didOpen, etc.
    • sos-language-server would:
      • parse the sos syntax (with access to the whole file)
      • determine which actual language server should be started/configured
        • hopefully being able to reuse the configuration machinery from jupyter-lsp
      • start and send initialize to each of those languages
      • wait for all of those to start up
      • transform the messages coming back, potentially sending it over the WebSocket
        • one of the first messages received is usually textDocument/publishDiagnostics
      • finally, merge all those responses, and send it back

appears easier, and cleaner to implement this at the frontend level

That's your call: as an extension to an extension to an client, the stuff would "only" work with jupyterlab-lsp, and only with the version of jupyterlab we support, and therefore would need to be upgraded in pretty tight lockstep to the Lab version. No doubt you could write your stuff in such a way that the "guts" could be used in another client.

dynamically change the language of the kernel. I can work on a PR if this is allowed by the architecture, and acceptable to the team.

As I mentioned, have a look at #191. If, instead of requiring hacking a bunch of typescript (which, yes, we should of course allow, expose, and dogfood to implement any of the below), sos could do one or more of:

  • Put in A Folder some schema-constrained JSON, some nunjucks templates, or a portable grammar which gets exposed by jupyter_lsp
  • propose and implement a Kernel comm target, e.g. jupyter.lsp.transclusions which sos can use

..to mostly-statically describe "ways to transform code and into what language". The kernel-based approach could potentially offer said code transformation dynamically. This would support these concepts in a way that jupyterlab-lsp would only be a reference implementation, not the only implementation.

@BoPeng
Copy link
Author

BoPeng commented Jun 14, 2020

@bollwyvl Thanks for all the info. Let me dive into language server (protocol and implementation) and source code of jupyterlab-lsp before getting back to you.

@krassowski
Copy link
Member

@BoPeng just wanted to let you know that I worked hard on restructuring the source code to make it more pleasant to look at. Also, potentially of your interest could be the improved cell-level syntax highlighting that we added here: https://github.com/krassowski/jupyterlab-lsp/pull/319. Please let us know if you are still interested in working on ridging SoS with jupyterlab-lsp - we are always happy to help!

@BoPeng
Copy link
Author

BoPeng commented Sep 11, 2020

Yes, this is on my TODO list, even relatively high, but I am swamped with other obligations (covid related projects, not surprisingly) and have not been able to work on this.

@BoPeng
Copy link
Author

BoPeng commented Dec 14, 2020

I had another look at the problem and it is likely a sos language server as @bollwyvl suggested is the best way to proceed. It would be a larger project than what my current bandwidth allows so it will take a while for sos users to make use of language servers.

@krassowski
Copy link
Member

Okay, instead of creating sos-language-server, why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

@westurner
Copy link

[...] why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

Are there any obstacles?

@westurner
Copy link

westurner commented Sep 23, 2021

jupyterlab/debugger could/should/must also support multi-language notebooks. Are there similarities in implementation of the multi-language abstractions for LSP and for jupyterlab/debugger DAP support?

@BoPeng
Copy link
Author

BoPeng commented Sep 23, 2021

Okay, instead of creating sos-language-server, why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

That will make things much easier for SoS. SoS currently uses kernel meta data to specify the kernel of each cell, but I am willing to change that to whatever will be used by jupyterlab-lsp.

BTW, congratulations on the merge of jupyter/enhancement-proposals#72 !

@denvesi
Copy link

denvesi commented Sep 24, 2021

Okay, instead of creating sos-language-server, why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

@krassowski I would be interested in implementing this. I am a student and currently writing my master thesis and the project I am working on would benefit from supporting language servers. Unfortunately, the current state of the LSP plugin (if I understand it correctly) doesn't fit our use case, because we use multiple languages in one notebook. Per-cell language servers would solve this issue, so I would like to contribute. Though I am not the most experienced developer and I need to get a bit more familiar with the existing code, so a little guidance or at least general idea on how to solve this would be very much appreciated. :)

@krassowski
Copy link
Member

You are very welcome to do work on it. I will be available to help and guide you if you run into any problems, though I may have longer response time than usual as next two weeks are very busy for me. I will try write up something with references to the code over the weekend.

@denvesi
Copy link

denvesi commented Sep 26, 2021

Thanks! That sounds great! It may take some time, because I am just at the beginning of my thesis, but I will try my best. Some references would be very helpful indeed.

@denvesi
Copy link

denvesi commented Oct 27, 2021

You are very welcome to do work on it. I will be available to help and guide you if you run into any problems, though I may have longer response time than usual as next two weeks are very busy for me. I will try write up something with references to the code over the weekend.

@krassowski Just a little update: I am still busy with some other parts of my thesis, but I'll have time to work on this issue soon. I know you're busy and I don't want to bother you, but I would really appreciate, if you could write a little guidance regarding the code and a general idea for solving the problem. That would help me a lot. Thanks in advance!

@krassowski
Copy link
Member

Very quickly: on the relevant implementation level each cell (and file editor but this is not relevant) is represented by ICodeBlockOptions

export interface ICodeBlockOptions {
ce_editor: CodeEditor.IEditor;
value: string;
}

Code blocks are appended one by one by VirtualDocument.append_code_block():

append_code_block(
block: ICodeBlockOptions,
editor_shift: CodeEditor.IPosition = { line: 0, column: 0 },
virtual_shift?: CodeEditor.IPosition
) {
let cell_code = block.value;
let ce_editor = block.ce_editor;
if (this.isDisposed) {
console.warn('Cannot append code block: document disposed');
return;
}
let source_cell_lines = cell_code.split('\n');
let { lines, foreign_document_map, skip_inspect } = this.prepare_code_block(
block,
editor_shift
);
for (let i = 0; i < lines.length; i++) {
this.virtual_lines.set(this.last_virtual_line + i, {
skip_inspect: skip_inspect[i],
editor: ce_editor,
// TODO this is incorrect, wont work if something was extracted
source_line: this.last_source_line + i
});
}
for (let i = 0; i < source_cell_lines.length; i++) {
this.source_lines.set(this.last_source_line + i, {
editor_line: i,
editor_shift: {
line: editor_shift.line - (virtual_shift?.line || 0),
column:
i === 0 ? editor_shift.column - (virtual_shift?.column || 0) : 0
},
// TODO: move those to a new abstraction layer (DocumentBlock class)
editor: ce_editor,
foreign_documents_map: foreign_document_map,
// TODO this is incorrect, wont work if something was extracted
virtual_line: this.last_virtual_line + i
});
}
this.last_virtual_line += lines.length;
// one empty line is necessary to separate code blocks, next 'n' lines are to silence linters;
// the final cell does not get the additional lines (thanks to the use of join, see below)
this.line_blocks.push(lines.join('\n') + '\n');
// adding the virtual lines for the blank lines
for (let i = 0; i < this.blank_lines_between_cells; i++) {
this.virtual_lines.set(this.last_virtual_line + i, {
skip_inspect: [this.id_path],
editor: ce_editor,
source_line: null
});
}
this.last_virtual_line += this.blank_lines_between_cells;
this.last_source_line += source_cell_lines.length;
}

which calls VirtualDocument.prepare_code_block to extract fragments of code (which may be in different languages) which is actually implemented in VirtualDocument.extract_foreign_code to append the foreign code to the appropriate foreign virtual document:

extract_foreign_code(
block: ICodeBlockOptions,
editor_shift: CodeEditor.IPosition
) {
let foreign_document_map = new Map<
CodeEditor.IRange,
IVirtualDocumentBlock
>();
let cell_code = block.value;
for (let extractor of this.foreign_extractors) {
// first, check if there is any foreign code:
if (!extractor.has_foreign_code(cell_code)) {
continue;
}
let results = extractor.extract_foreign_code(cell_code);
let kept_cell_code = '';
for (let result of results) {
if (result.foreign_code !== null) {
let foreign_document = this.choose_foreign_document(extractor);
foreign_document_map.set(result.range, {
virtual_line: foreign_document.last_virtual_line,
virtual_document: foreign_document,
editor: block.ce_editor
});
let foreign_shift = {
line: editor_shift.line + result.range.start.line,
column: editor_shift.column + result.range.start.column
};
foreign_document.append_code_block(
{
value: result.foreign_code,
ce_editor: block.ce_editor
},
foreign_shift,
result.virtual_shift
);
}
if (result.host_code != null) {
kept_cell_code += result.host_code;
}
}
// not breaking - many extractors are allowed to process the code, one after each other
// (think JS and CSS in HTML, or %R inside of %%timeit).
cell_code = kept_cell_code;
}
return { cell_code_kept: cell_code, foreign_document_map };
}

There is also a notion of standalone snippets: even if consecutive cells use the same language, sometimes we do not want to merge them into the same virtual document (e.g. %%python magic which upon execution spawns a new interpreter so it is independent of any previous %%python magics); this is handled by:

private choose_foreign_document(extractor: IForeignCodeExtractor) {
let foreign_document: VirtualDocument;
// if not standalone, try to append to existing document
let foreign_exists = this.foreign_documents.has(extractor.language);
if (!extractor.standalone && foreign_exists) {
foreign_document = this.foreign_documents.get(extractor.language);
this.unused_documents.delete(foreign_document);
} else {
// if standalone, try to re-use existing connection to the server
let unused_standalone = this.unused_standalone_documents.get(
extractor.language
);
if (extractor.standalone && unused_standalone.length > 0) {
foreign_document = unused_standalone.pop();
this.unused_documents.delete(foreign_document);
} else {
// if (previous document does not exists) or (extractor produces standalone documents
// and no old standalone document could be reused): create a new document
foreign_document = this.open_foreign(
extractor.language,
extractor.standalone,
extractor.file_extension
);
}
}
return foreign_document;
}

Back to appending code blocks: ICodeBlockOptions does not pass any cell metadata (is not even aware of cell existence) - it only passes the value and the reference to the editor. To condition extraction of virtual documents on cell metadata this needs to be passed too. The actual append operations are executed in:

/**
* Update all the virtual documents, emit documents updated with root document if succeeded,
* and resolve a void promise. The promise does not contain the text value of the root document,
* as to avoid an easy trap of ignoring the changes in the virtual documents.
*/
public async update_documents(blocks: ICodeBlockOptions[]): Promise<void> {
let update = new Promise<void>(async (resolve, reject) => {
// defer the update by up to 50 ms (10 retrials * 5 ms break),
// awaiting for the previous update to complete.
await until_ready(() => this.can_update(), 10, 5).then(() => {
if (this.isDisposed || !this.virtual_document) {
resolve();
}
try {
this.is_update_in_progress = true;
this.update_began.emit(blocks);
this.virtual_document.clear();
for (let code_block of blocks) {
this.block_added.emit({
block: code_block,
virtual_document: this.virtual_document
});
this.virtual_document.append_code_block(code_block);
}
this.update_finished.emit(blocks);
if (this.virtual_document) {
this.document_updated.emit(this.virtual_document);
this.virtual_document.maybe_emit_changed();
}
resolve();
} catch (e) {
this.console.warn('Documents update failed:', e);
reject(e);
} finally {
this.is_update_in_progress = false;
}
});
});
this.update_done = update;
return update;
}

with these constructed from editors map in adapters:

public update_documents() {
if (this.isDisposed) {
this.console.warn('Cannot update documents: adapter disposed');
return;
}
return this.virtual_editor.virtual_document.update_manager.update_documents(
this.editors.map(ce_editor => {
return {
ce_editor: ce_editor,
value: this.virtual_editor.get_editor_value(ce_editor)
};
})
);
}

which for notebooks are:

get editors(): CodeEditor.IEditor[] {
if (this.isDisposed) {
return;
}
let notebook = this.widget.content;
this.ce_editor_to_cell.clear();
if (notebook.isDisposed) {
return [];
}
return notebook.widgets
.filter(cell => cell.model.type === 'code')
.map(cell => {
this.ce_editor_to_cell.set(cell.editor, cell);
return cell.editor;
});
}

and for file editors there is only one editor:

get editors(): CodeEditor.IEditor[] {
return [this.editor.editor];
}

@krassowski
Copy link
Member

We have to make the information on cell metadata available to the code extracting foreign virtual documents, so it might make sense to generalize the editors() getter so that it returns an object which includes both CodeEditor.IEditor and metadata. We may want to have this as a separate getter and reimplement get editors() as a simple extraction from the result of that new getter for backward compatibility.

Or we may want to go in all-in and rewrite this code from scratch and release a new major version.

One thing I very much want to include is the reference to the cell (its identifier) as a comment in the virtual document content so that we can reliably translate back-and-forth between the virtual document and the cells, enabling full-blown refactoring as described in #467. It might or might not be beneficial to rewrite the virtual document to live on the backend, but I think that we should first try to implement it in TypeScript.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants