Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline spell checker for VSCode #20266

Open
bartosz-antosik opened this issue Feb 9, 2017 · 23 comments
Open

Offline spell checker for VSCode #20266

bartosz-antosik opened this issue Feb 9, 2017 · 23 comments
Labels
feature-request Request for new features or functionality languages-basic Basic language support issues
Milestone

Comments

@bartosz-antosik
Copy link

bartosz-antosik commented Feb 9, 2017

Hello (first time contributing here)

There are few offline spell checkers among VSCode extensions, but they are based on seriously faulty JavaScript implementations of Hunspell spell checker.

Hunspell is nowadays probably the most widespread standard for spell check layer. It is used on MacOS, Linux and in some software (e.g. LibreOffice) on Windows. It is also used by both Atom and Sublime Text. There is an enormous collection of polished dictionaries for Hunspell.

There exists some JavaScript implementations that refer to Hunspell's name but in fact they do not implement critical functionality - lexical parser. I have verified these three:

hunspell-spellchecker
Typo.js
nspell

All three work more or less following a simple idea of loading the dictionary into memory (into a associative table, a.k.a. dictionary, object to be precise). They use the Hunspell's affixes (.aff file) to create ALL variants of the words found in the dictionary (.dic file) and then store them in the memory. When checking spelling dictionary is simply asked whether the word exist or not. Simple, but it has these implications:

  1. Loading takes a lot of time;
  2. It takes a lot of memory too;
  3. Memory consumption causes them to crash under dictionaries with more expanded affix system (two out of three mentioned, third does not consume all of the affixes).

For example when running hunspell-spellchecker (there is a SpellChecker extension based on it) with English dictionary ("en_US", 62K+ words in dictionary) memory consumption is in peaks 500 MB and constantly above 250 MB. It crashes under Polish language dictionary ("pl_PL", 300K+ words in dictionary) after reaching about 1.5 GB memory consumed (there are reports about other dictionaries doing the same) with "JavaScript heap out of memory" message hidden well under the hood. Hunspell has a lexical parser which allows it to use these two sets (dictionary and affixes) "on the fly" without the need to merge them thus exploding memory consumption and load time.

There is a good spell checker component for node.js, which is actually a bindings for native spell checkers for MacOS (NSSpellChecker), Linux (Hunspell) and Windows (Spell Check API in windows 8+, Hunspell in earlier versions):

https://github.com/atom/node-spellchecker

It is alas a native module.

I have built a spell checker using this module. I will rather not publish it because it is quite pointless:

  • The extension will (silently) stop working every time the electron or node get a version bump and I cannot guarantee I will always be around to rebuild binary dependencies quickly;
  • Rebuilding binary dependencies is quite a hassle;
  • I am unable to reasonably maintain binary dependencies for all three platforms (MacOS, Linux & Windows) - there already is an extension which uses this module, but it provides binary dependencies for MacOS only;
  • Even If I would produce node-spellchecker module using node-pre-gyp with binaries for various platforms if I understand things correctly extension cannot (easily?) install dependent modules using npm (which could also imply having a proper C++ toolchain around in case node-pre-gyp packaged binaries are not sufficient). Binaries are packaged and simply get downloaded along with the extension.

So I would like you to consider doing something about it.

There are few paths I can imagine among them two are most obvious:

  1. Build the node-spellchecker module along with the VSCode and make it available among "standard" modules that extension developers can count upon (this could result in more than one spell checker extension e.g. for spelling text or latex documents, comments in code etc.);
  2. Provide a way to use native modules among extensions' dependencies.

I am most probably no one to discuss pros or cons of these alternatives, there are maybe other alternatives that I cannot see, but I think that with the evidence provided it is clear that unless something changes the answer to the question in the title is MOST PROBABLY NOT!

@Tyriar
Copy link
Member

Tyriar commented Feb 9, 2017

I'd personally like to see spell checking using native platform API bundled eventually.

@Tyriar Tyriar added the feature-request Request for new features or functionality label Feb 9, 2017
@bartosz-antosik
Copy link
Author

Do I get it correctly than by native you mean implemented in JavaScript/TypeScript?

I understand your point of view, I would probably prefer it too.

But please have a look on Hunspell GitHub page and consider how many years of active developement it took to get it where it is now. I doubt that someone could just "rewrite" it.

Anyway, for the time being existing spellcheckers are problematic and it may push away e.g. people who use VSCode to write technical docs, latex papers etc.

Having node-spellchecker accessible with compatible binaries, which can only realistically be achieved by bundling it with VSCode, would cure not even most but all of the above mentioned issues. Later when native API with comparable quality appears you can switch and it will not generate a lot of trouble for extensions developers because it is only a handful of calls.

@Tyriar
Copy link
Member

Tyriar commented Feb 10, 2017

I mean native as in platform; talking to macOS, Windows, Linux services if available, and falling back to another implementation if not.

@bartosz-antosik
Copy link
Author

That's EXACTLY what node-spellchecker module does!

@bartosz-antosik
Copy link
Author

Somehow I go impression that you tend to have everything in pure JavaScript and native modules support is not going to appear anytime soon. Sorry about misinterpretation.

@Tyriar
Copy link
Member

Tyriar commented Feb 10, 2017

Badly worded on my part 😄

@rebornix
Copy link
Member

rebornix commented Mar 22, 2017

First of all, sincerely many thanks to @bartosz-antosik , your work is really thorough and an awesome guidance on spell checking. I spent some time investigating into this feature this iteration and here are my thoughts and todo items.

How we ship it

Firstly, the spell checking process should work in a separate process, without blocking the core or extension host pipeline.

Secondly, there are two ways to talk to native code: node native module or standalone script with interactive console. The former is easy to do as the only catch is you have to recompile every time if node/v8 version changes. To fix that we just need to put the extension into Code's folder, either in core or in our builtin extension folder. The benefit is obvious, we don't need to talk to C++ code painfully and stay inside NodeJS always but there are several issues that should be taken care of before we do that

  1. Since we need to bundle Hunspell into Code, how does it affect our installer size/build time?
  2. Right now node-spellchecker's api is not even async.
  3. It should still be executed in a separate process. There are quite a few perf issues on Atom/node-spellchecker side.

The second way to solve this problem is running a standalone interactive script, which talks to system API, compiled in different architecture/platform. Then our NodeJS code, either Core or an extension can talk to it through standard IO or even better Socket. The script will be running in a new NodeJS process and we can easily make all the spell checking async.

I start with the second solution. Even though this problem is fixed perfectly, we still get quite a few issues around the experience and maturity of spell checking on different platforms, including but not limited to:


Spell Check API

On macOS and Windows (8 and above), the system provides builtin spell check support, their behaviors vary but they both support following common functionalities

  • check whether a word is misspelled.
  • get the ranges of mispelling in a text
  • get the corrections for a mispelled word

In addition to above features, macOS, Hunspell and Windows disagree with each other on several APIs:

Ignore word

  • Windows.
    • ✋ Right now node-spellchecker is using https://msdn.microsoft.com/en-us/library/windows/desktop/hh869774(v=vs.85).aspx . It will treats the provided word as though it were part of the original dictionary. The drawback of this API is once it adds the word to the dictionary, you can no longer delete it through API. A workaround is modifying file C:\Users---UserName---\AppData\Roaming\Microsoft\Spelling\en-US\default.dic and reload. I'll say let's use ignore
    • If we use https://msdn.microsoft.com/en-us/library/windows/desktop/hh869783(v=vs.85).aspx to ignore words, its lifetime is just the session.
    • ✋ node-spellchecker // NB: ISpellChecker has no way to remove words from the dictionary , it means node-spellchecker thinks right now you can't remove words from the dictionary, even though they are added by you :(
  • Hunspell.
    • Life time: When using Hunspell, this will not modify the .dic file; custom words must be added each time the spellchecker is created. Use a custom dictionary file.
  • macOS https://developer.apple.com/reference/appkit/nsspellchecker/1534837-learnword
    • Life time: until you ignore it.
  • If we want to have the same experience in different platforms, we may want to use ignore on Windows and then users always have their ignore list on their preferred storage.

Conclusion: On Windows and Hunspell, ignore words temporarily and each time we initialize a spell check process, set the ignore list on the fly. As on macOS you can always remove words from the dictionary, let's trust it.


Builtin language support

  • Builtin language on macOS and Windows
    • macOS: It depends on the language you install on your System. You can go to System Preference -> Keyboard->Text->Spelling and sort the dictionaries as macOS has automatic language detection, you may want to have a good sorting if you are depending on automatic language support.
    • Windows. Black box, what I learn from users feedback/bug report on Spell Check is it's part of language pack. We don't have a chance to touch anything, maybe that's why Atom uses Hunspell always on Windows.
  • Builtin language dictionary we ship with Hunspell. Basically we can ship en_us and en_gb (less than 1Mb in total) and try to download dictionary when necessary.

What's the experience of setting up dictionaries for another language which has no builtin support?

  • Hunspell, it's customizable as you can provide a path to your dictionary.
  • macOS. Users can put dictionaries into ~/Library/Spell. We can't programmatically ask macOS Cocoa API to load dictionary from another place but we may want to touch that folder directly.
  • Windows. Users have to install target language package.

How to spell check text which contains multiple languages, automatically?

System already has some native support, but they behave differently and the experience is not charming.


Dictionaries

  • Hunspell Dictionaries we ship along with the product
  • Dictionary compatibility: As you can't set another kind of dictionary to Windows, the only way to have a custom dictionary is using Hunspell.

Both Chrome and Firefox ship with en-US dictionary (for English users). Chrome will download any dictionary users require ( see https://cs.chromium.org/chromium/src/chrome/browser/spellchecker/spellcheck_hunspell_dictionary.cc?dr=C&q=chrome/dict&l=238 ), and Firefox fetches dictionaries from https://dxr.mozilla.org/mozilla-central/source/browser/app/profile/firefox.js#77.

Conclusion: Ship with en-US (because most of time you are coding in English) and maybe ship with one user preferred language (for example, maybe one day users can get a Chinese version of VS Code directly and it has Chinese dictionary builtin). For other requests, provide a stable/high available dictionary download service. Atom now downloads dictionaries from Google's service (which is used by Chrome), however that service is not available in some countries and regions.

Exception list/known Words

  • How to ignore a word, where are they stored, life time? See above.
    They should be stored in a separate file and each time a spell checker instance is created, update the ignore list on the fly.

  • How can users set a exception for known words, through context menu, command palette, or light bulb code action?


Spell Checker

  • Can users choose their favorite Spell Checker, eg, use Hunspell even on Windows? It can be useful if users who work on the same product but on different platforms want to align with each other about spell checking.

We can ship Hunspell in all platforms and users can choose to use Hunspell or not.


Settings

Open questions about how we define the settings for spell checking.

  • Setting for Dictionary, Path, etc.
  • Can it be a setting file in workspace/user space?
  • Can it be part of Workspace/User Setting?
  • Scopes for spell checking
    • Based on programming lanuage id.
    • Based on TextMate Token Type(String, Comments, Other)
    • Based on TextMate Grammar (text.html.markdown, etc)
  • Spell check actions
    • Light bulb and red squiggles
    • Context menu
    • Command: Fix mispelled words
    • Auto fix
    • Can spell checker be used for autocomplte/intellisense?

@bartosz-antosik
Copy link
Author

Thanks @rebornix for kind words & analysis which I like a lot.

I would like to refer to few points of your analysis as it looks like maybe I do not understand one or more things.

Excuse me if I am very off at points but I know very little about node.js and the whole environment.

Synchronous/Asynchronous Interface

About this sync/async interface: are events (e.g. onDidOpenTextDocument, onDidChangeTextDocument, onDidChangeVisibleTextEditors) asynchronous or not?

If they are then then why bother if node-spellchecker's interface is or is not?

If they are not then not only spell checking engine should be asynchronous but all the extension code that reacts to events to parse text & select parts to spell that calls the engine should be too, should it not?

What takes time in spelling is parsing a document, possibly large, and eliminating parts that should not be spelled (suppose latex commands or parts of code that should be skipped to spell comments & strings etc.) And I recon it should be left up to the extension, not the speller, to decide on what to do with particular document type.

There is one more thing to consider here: Word lookup is quick. Suggestions are slow.

About spellcheckers that I used they are quick to look word up to test whether it is spelled correctly and slow (like over 10 times slower on average) to produce suggestions. Current approach e.g. in my spell checker extension is to spell & feed diagnostic collection with suggestions plus there is an option to just signal misspelled words and look up suggestions on provideCodeActions event.

So do I understand correctly that either all parts of the process should be async or it does not matter much whether spelling engine is?

Ignoring Words

About custom/known/ignored words: I would consider off loading this to the extension! Don't know about the rest of the world but I would love them to be manageable like rest of the VSCode's configuration. All three MS/iOS/hunspell place them no one knows where and it is additional pain to transfer them to another location or manage them in the context of the document type.

Language Scope in a Document

I like the idea of multiple languages inside one document a lot. It seemed to me crazy at first but the more I think about it it seems quite doable. The only way though I can think of is content/comment driven language switching. Again - the extension should decide about this, as this information can be, for instance, extracted from latex document quite other way than from other document type.

@rebornix
Copy link
Member

@bartosz-antosik thanks for your reply. About async/sync problem, I'm referring to function calls to native code, they are no async right now. But it's not a problem as in nodejs, we can always use setTimeout or similar to mitigate it. Not a big deal.

Word Lookup/Suggestions

I like your idea of separating word look up and generate suggestions and thanks again for your perf testing. Postponing suggestion lookup to code action provider makes sure we only do minimal calculation. And you are right, this can be an option as the only catch of this feature is users can't have a general view of misspell suggestions in Problems View.

Another thing about perf is where to do the calculation, doing all the math in native code can be faster but sending a large portion of data to native code can cost time as well. We need good testing to find the balance.

Ignoring words

System Spell Checker stores the ignoring words on the fly and yes we'll hide them from users.

Multi language

macOS has its in-house language detect which works reasonable to me but Windows doesn't. Comments, strings and technical documents are the most possible cases that users may need multi-language support. We can either switch languages automatically, or maybe even spawn multiple spell check process for different languages.

@rebornix rebornix changed the title Will there ever be a decent offline spell checker for VSCode? Offline spell checker for VSCode Mar 22, 2017
@Jason3S
Copy link

Jason3S commented Mar 23, 2017

Hello

I'm the author of Code Spell Checker extension and cspell linter (used by the extension).

Why

I did not intend to write a spell checker. I wrote it because I needed one that worked with source code and didn't find a built in checker. So the fact that you are considering having a spell checker built in is wonderful. It would have saved me a bunch of effort. :-)

To be honest, it was a fun exercise. It needed to load fast and execute fast. It needed to limit memory consumption and work with very large dictionaries. Spelling suggestions needed to be quick and applicable. Importantly, I wanted it to run on all platforms. I was able to achieve all of these things.

How it works

I did not choose any of the Hunspell solutions due to speed and memory concerns. The Hunspell format is designed for compact representation of words with common prefix and suffix patterns. The Hunspell .dic and .aff are deliberately easy for adding words by hand. The format is not designed for easy lookup or searching. Which is why the open source javascript solutions are very slow and use a lot of memory.

Instead I wrote a hunspell file reader that would output all the word combinations. This list of words is compiled into a compact format designed for lookup speed and calculating suggestions. At its core is a Trie which is optimized into a Deterministic Acyclic Finite State Automaton.

This process of compiling is rather expensive, which is why it is done offline and only the compiled dictionaries are shipped with the extension.

Word Lookup and Suggestions

Word lookup is O(m) where m is the length of the word. It is a very simple process of walking the Trie. Suggestions are done using a modified Levenshtein algorithm that minimizes recalculation and culls candidates by not walking down branches in the Trie whose minimum possible error is greater than the allowed error threshold.

Things to consider

Most of the work was not writing the spell checker. Checking words and making spelling suggestions is rather easy. Most of the work came from the configuration options. Where possible, the system is configuration driven.

Each programming language has its own combination of dictionaries and settings. In the linter fashion, the spell checker also allows for in code flags and settings.

Programming Language Dictionaries

I ended up creating dictionaries that included keywords and common symbols for several programming languages. These dictionaries can be combined based upon the context.

For example a .cpp file will use the following dictionaries: cpp, companies, softwareTerms, misc, filetypes, and wordsEn.

As you can see, I even needed a dictionary for common software terms, because standard Hunspell dictionaries do not include most software terms.

Programming Language Grammar awareness

I did not make my spell checker aware of the programming language grammar or syntax. There are some really cool things that are possible. Like having strings be in French while the code is in English and the comments are in Spanish. Other things like not spell checking 3rd party imports. Yet, I found this more work than I had time to spend.

As an extension writer, I was wishing for access to the language grammar used by the colorizers.

Linter Style

I think it is worth noting that a spell checker is usable in a Continuous Integration environment. Think of it as anyplace you might want to use tslint a spell checker might be useful.

@Jason3S
Copy link

Jason3S commented Mar 23, 2017

Questions

  1. How do you plan on parsing the code to send it to the spell checker? Spell checkers do not like camelCase or snake_case.
  2. How do you plan on solving the multi language issue? Where the code and comments are in English while the strings are in Spanish?
  3. What is the plan for project and user level word lists?
  4. If a users adds their own words to the dictionary, will they be included in the suggestions?
  5. Reading the discussion, it looks like the plan is to call the spell checker one word at a time. Won't that be very slow?

@matklad
Copy link

matklad commented Mar 7, 2019

How do you plan on parsing the code to send it to the spell checker? Spell checkers do not like camelCase or snake_case.

Note that this is programming-language dependent, and, for this reason, it makes sense to make spellchecker itself part of the platform, and expose language-dependant parts via LSP. Here's a list of things which could be handled by language server but can't be reasonably handled by spell checker extension alone:

For markup langauges, dealing with subwork markup. For example, in asciidoctor I can write
**A**plicaton to make the first word bold, and it'd be cool if spellchecker saw this as an error.

For all languages, langauge server needs to unescape string literals and strip // from comments.

For all languages, there should be a language-specifc built-in dictionary

For statically typed languages, spell checking should be done only for definitions, and not for references: catching misspellings in the references is the job of compiler and code completion.

@FDiskas
Copy link

FDiskas commented Apr 18, 2019

I'm sorry - but why not to use chrome internal spell checker?
There is a good library to help implement that
https://www.npmjs.com/package/electron-spellchecker

@jrieken jrieken added the editor label Oct 9, 2019
@elcste
Copy link

elcste commented Feb 11, 2020

Electron 8 includes support for the built-in Chromium spellchecker. Maybe now this feature would be easier?

@borekb
Copy link

borekb commented Feb 18, 2020

This looks like a primary issue for built-in spell checking in VSCode so if it's going to happen with the new Electron 8.0 capabilities, I'd like to add a few notes:

  • @bartosz-antosik's excellent Spell Right extension is important for my workflow and one awesome thing about it is that it can distinguish contexts, for example, I can spell-check code comments but not the code itself. If VSCode provides some sort of spellchecking API, it would be great if Spell Right could hook into it but keep its specific super-powers.

  • Authoring commit messages would be much better with spell checking. In fact, I often switch to an external editor if the commit message is longer, which is a workaround I'd love to get rid of. SCM: Support input box spell checking #35571

  • There are VSCode extensions that would benefit from spellchecking greatly, for example, GitHub Pull Requests (there's a feature request here: Spell check comments vscode-pull-request-github#1487).

@oschulz
Copy link

oschulz commented May 22, 2020

I guess the improved spell-checking capabilities of Electron v9.0 would be an ideal basis for VS-Code built-in spell-checking? I would love to have that - haven't found a reliable spell-checking extension yet that works under VS-code remote development.

@alanlivio
Copy link

Microsoft also has "Microsoft Editor Service" which work for both browser and desktop. Is there any way to use it in vscode?

@alexdima alexdima added languages-basic Basic language support issues and removed editor labels Oct 15, 2021
@Lemmingh
Copy link
Contributor

The discussion above about how to ship a spell checker appears not concluded. What about WASM? All major engines have been supporting WASM since 2017 according to the MDN compatibility data.

Someone has successfully compiled Hunspell as WASM: https://github.com/kwonoj/hunspell-asm . The Base64-encoded WASM binary of Hunspell is only about 780 kB, so there should be little difficulty in bundling.

@Talia-K-Loos
Copy link

+1

I came here to say this. Just a selectable spelling dictionary would do for me, even.

I'd use it for text files, markdown files, and most especially for files that are of the "git commit" language type.

@Pindar777
Copy link

Interesting discussion!
I'm fond of https://marketplace.visualstudio.com/items?itemName=valentjn.vscode-ltex
It is very helpful but takes a huge amount of storage.

@AshleyT3
Copy link

A mild +1 for at least rudimentary VSCode spellcheck out of the box if it seems reasonable given overall user asks. An office-like app has great spellcheck but won't start due to license check requirements if it has been offline for a long time. I prefer simple text files to avoid heavy client issues like that. VSCode supports this but without spellchecking out of the box. For certain note-taking cases, I look elsewhere... or perhaps copy/paste to office app w/spellcheck next chance. While +1 one this, it is not a push as though I'm waiting with anticipation for this... VSCode gets tons of usage in so many areas... I'd hardly complain about where it is at today... so a mild +1 if there happens to be tons of others who +1 and it makes overall sense. Hope this helps, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality languages-basic Basic language support issues
Projects
None yet
Development

No branches or pull requests