Use damerau-levenshtein algorithm for edit distance #15689

weswigham · 2017-05-09T00:59:14Z

This PR is opened against #15507 to propose using the damerau-levenshtein edit distance algorithm instead of just levenshtein distance - the primary differentiator being that the damerau variant finds adjacent transpositions and calls them a single edit operation, lowering the edit distance for that score. Additionally, I've further tweaked the algorithm to consider transpositions as ever so slightly cheaper than additions/deletions/substitutions, so that a transposed similar name, all other options considered, is guaranteed to come first.

In effect, this means that in a situation like this one (attached as a test case to highlight the difference):

var twnty;
var twenty;
var which = tewnty;

With normal levenshtein we'd suggest twnty, but with damerau-levenshtein both twnty and twenty are equally scored, and with my slight alteration to the weights, twenty ranks slightly cheaper and becomes the suggestion, which, I argue, is the better suggestion (and in the general case, I believe a transposition to be a more likely error on a keyboard over an omission, addition, or substitution, all other things equal - unfortunately I only have my gut and experience to back me up on this, not data).

A few new things also got suggestions in the baselines with this change (desirable, looks like) thanks to the often lowered cost.

The downside to the damerau-levenshtein algorithm is that it needs to keep an extra row of the table around (since to score a transposition it needs the score from two positions ago two rows ago), and it needs to make a few extra comparisons to check for transpositions. However with the top-level bucketing of obviously bad matches, I think the extra cost is likely worth the improved suggestion ranking.

DanielRosenwasser · 2017-05-09T06:31:09Z

src/compiler/utilities.ts

+                let tmp = current[j - 1] + INSERT_COST;
+                if (tmp < min) min = tmp;
+                if ((tmp = previous[j - 1] + (eq ? 0 : CHANGE_COST)) < min) min = tmp;
+                if ((tmp = (i > 1 && j > 1 && s1.charCodeAt(i - 1) === s2.charCodeAt(j - 2) && s1.charCodeAt(i - 2) === s2.charCodeAt(j - 1)) ? (prevprev[j - 2] + (eq ? 0 : TRANSPOSE_COST)) : Infinity) < min) min = tmp;


To be honest, I'm lost on the above four lines. Splitting into a few more lines, explaining what those lines do, and using more descriptive variable names would definitely help here. 😃

Sorry! This used to be in a Math.min call in the original implementation before I added the 4th case. Perf-wise, the manual comparisons are faster on most runtimes than the min call, and this was just easier to debug. :P

I'm not sure how to rename the variables usefully (tmp -> potentialDist?), but can see how a comment on each contribution would help. (First line is cost of a deletion, next an insert, then substitution, then transposition)

DanielRosenwasser · 2017-05-09T06:32:43Z

I do like the new changes, though there's some new noise (Date <-> data, but that's expected).

@sandersn what are your thoughts?

RyanCavanaugh · 2017-05-09T16:42:28Z

tests/baselines/reference/parserRealSource7.errors.txt

@@ -196,7 +196,7 @@ tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(476,57): error T
 tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(477,29): error TS2304: Cannot find name 'ValueLocation'.
 tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(478,29): error TS2304: Cannot find name 'hasFlag'.
 tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(478,55): error TS2304: Cannot find name 'VarFlags'.
-tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(480,21): error TS2552: Cannot find name 'FieldSymbol'. Did you mean 'fieldSymbol'?
+tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(480,21): error TS2304: Cannot find name 'FieldSymbol'.


This one disappeared because the other one, earlier in the file, had a suggestion found, and there's a cap on how many suggestions are produced?

Ah, that's fine then

RyanCavanaugh · 2017-05-09T16:42:43Z

tests/baselines/reference/parserRealSource7.errors.txt

@@ -658,7 +658,7 @@ tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(828,13): error T
 !!! error TS2304: Cannot find name 'StringHashTable'.
                    modType = new ModuleType(enclosedTypes, ambientEnclosedTypes);
                                  ~~~~~~~~~~
-!!! error TS2304: Cannot find name 'ModuleType'.
+!!! error TS2552: Cannot find name 'ModuleType'. Did you mean 'moduleDecl'?


This seems too "far"

Since right now we're calculating edit distance in a case-insensitive way - the edit distance is 4, which is the maximum allowable distance for a string of length 10 (I didn't change that heuristic). TBQH I think this one should have appeared before I changed the underlying algorithm. I also think there's value in including the case in the distance heuristic - it's just more information.

In any case, since transpositions are cheaper, it's probably safe to lower the heuristic for maximum edit distance.

sandersn · 2017-05-09T16:49:27Z

@RyanCavanaugh points out that we should probably merge spelling-correction first. I'll do that and then comment on this PR.

sandersn · 2017-05-09T17:13:11Z

tests/baselines/reference/ModuleWithExportedAndNonExportedFunctions.errors.txt

@@ -1,4 +1,4 @@
-tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2339: Property 'fn2' does not exist on type 'typeof A'.
+tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2551: Property 'fn2' does not exist on type 'typeof A'. Did you mean 'fng'?


After discussing what we want from this feature with @RyanCavanaugh, here's how we decided to summarise it: suggest corrections for misspellings that you "can't see".

That means we want very high precision (no false positives) and don't care about recall (it's ok to miss some fixes). It also means that short identifiers should only suggest a few things. Specifically, 3-character identifiers should only suggest additions, basically to cover the foo → _foo / foo1 case. fn2 → fng and bar → baz are obvious.

We can actually adjust the algorithm to reflect this! We can just reduce the cost of an addition relative to a substitution, such that a suggestion with one character added is preferred to a suggestion with one character substituted.

The reason this suggestion was added, on inspection (since it doesn't use any transpositions), is because you weighted a substitution as 2 edits in your algorithm, whereas when I rewrote it I kept it at 1. Should I just revert the weight of substitutions to 2?

Ah this is the same reason why the other result @RyanCavanaugh pointed out appeared, too. Previously it had a distance of 8 with the old weighting of substitutions, which is why when I reverted it to 1 it appeared!

Yes. Substitutions should be more expensive than inserts and deletes.

Yeah, re-baseline with the boosted sub cost and I'll take another look at the results.

weswigham · 2017-05-09T17:36:50Z

@sandersn I've matched the weight you had for substitutions, how the baseline noise is nil.

weswigham · 2017-05-09T17:44:09Z

@sandersn rebased onto master and swapped the PR to be against master.

sandersn

I don't think we should merge this until we know that special-casing substitutions will help with real misspellings.

Couple of notes:

We should have a RWC entry of some realy mispleled javascipt code bases so we can test suggestions. I'm not sure where to get these since github generally hosts "correct" code.
My experience with research in dialectology [1] showed that basic Levenshtein distance is pretty good, and complicating the basic algorithm improves only a little. That result might not apply to programming-language spelling suggestions though!
Because of (2), I'd rather experiment with Double Metaphone to see whether its suggestions would be noticeably better. I didn't want to add a dependency to typescript, though.

[1] http://www.tandfonline.com/doi/abs/10.1080/09296170802514138

sandersn · 2017-05-09T18:00:51Z

src/compiler/checker.ts

                    if (distance > worstDistance) {
                        continue;
                    }
-                    if (distance < 3) {
+                    if (distance <= MIN_LEV_DIST) {


for efficiency's sake, we should keep the early exit at 2, to include insertions (and double insertions).

That would make my example above revert to the prior behavior where we pick twnty over twenty because it's "good enough", when it's really not.

sandersn · 2017-05-09T18:04:24Z

src/compiler/utilities.ts

+    const TRANSPOSE_COST = 0.95; // Transpositions slightly cheaper than insertion/deletion/substitution to promote transposed errors
+    export const MIN_LEV_DIST = TRANSPOSE_COST;
+    export function dameraulevenshtein(s1: string, s2: string): number {
+        if (s1.length > s2.length) [s1, s2] = [s2, s1]; // Swap arguments so s2 is always the short one to use minimal memory


I don't think optimisations of the n**2 algorithm are as worthwhile as just not calling it as much. (Of course I don't actually know — measurement is the right thing to do here.)

We were already reusing our arrays to save memory (and complicating the code a little to do so), so why not go all the way and use the minimal memory for the algorithm?

weswigham · 2017-05-09T19:48:29Z

@sandersn You mean transpositions, right? Since substitutions you were already special-casing with the increased weight?

Anyways, while I agree that a bigram approach might yield the best suggestions, and I also agree that basic levenshtein distance is decent (I've used it), the damerau-levenshtein distance is a common improvement for keyboard-based editing error estimating (it is just a levenshtein distance that calculates transpositions as a cheaper edit). While I ended up editing the original code to use comparisons instead of math.min and to use constants for weights, the only real difference between the original algorithm and the damerau variant is the addition of an extra row of history and the extra check for transpositions to detect them and reduce their cost. I also provided a compelling example of where the strict levenshtein distance provides a subpar suggestion by comparison; I think the cost of the improvement, compute-wise, is actually pretty low. I've seen variants with configurable per-pair weights (eg, a substitution of f to g can cost less than f to m, with weight based on keyboard distance), which could, again, slightly improve suggestions, but is way more work to generate the weights for than I feel like is reasonable (and would require including a big table of weights in our source).

sandersn · 2017-05-09T20:09:22Z

Yep, I meant transpositions.

I do agree that worrying about the efficiency of an n**2 algorithm doesn't make sense. It's more important to me to keep the code simple.

I'd like to see more convincing data to justify the additional code complexity. Specifically (1) improvement in some existing code base (2) a more convincing example.

For (2), we're not suggesting arbitrary corrections — there's a small lexical space for most errors. I'd like to see a convincing pair of declarations that differs by one character, and a mistake that transposes the characters of the longer identifier.

I thought something like

var ex1Interface
var exInterface

But I can't come up with a convincing transposition. Maybe `exI1nterface'?

weswigham · 2017-05-09T20:14:21Z

@sandersn I end up transposing characters I type all the time. Common transpositions include (ironically, since I've probably done it twice while writing this), include swapping n and s (Hands type independently, one hand types a bit faster than the other, comes down out of order). Also (again, a transposition error I made and corrected while typing this) o and f while writing of.

So, for example:

var ForKeyword;
var OfKeyword;
var which = fOKeyword;

I do also think we're discarding valuable information by discarding case information 🤷‍♂️ .

Re:

But I can't come up with a convincing transposition. Maybe `exI1nterface'?

There's only a difference between the algorithms if the potential transposition overlaps with an added/deleted character in one of the potential options.

The net effect is such that a suggestion which has the right characters, but in the wrong order, is preferred over a suggestion which has a differing number of characters.

Re:

improvement in some existing code base

I don't think an existing code base normally contains spelling errors, since it wouldn't compile then 😸

RyanCavanaugh · 2017-05-09T21:08:37Z

Typos in the wild from StackOverflow (why can't Angular programmers spell?):

timeooutProvider
timeoutProvider

ngif
ngIf

angluar
angular

ation
action

itemStockPositonService
itemStockPositionService

parent.bankDrafts
$parent.bankDrafts

posts.categoria
post.categoria

HTTP
Http

colums
columns

lable
label

taskService
tasksService

extCheckbox
ext-Checkbox

mhegazy · 2017-05-23T22:03:01Z

Given the additional value we get from this change that we do not have with the plain old Liechtenstein implementation vs the complexity added, i would say it does not seem to be worth doing. in the future, we could upgrade to metaphone 2 to get better suggestions here.

sandersn · 2017-05-23T22:05:04Z

💡 "Liechtenstein" -- did you mean "Levenshtein"?

msftclas added the cla-already-signed label May 9, 2017

DanielRosenwasser reviewed May 9, 2017

View reviewed changes

RyanCavanaugh reviewed May 9, 2017

View reviewed changes

sandersn reviewed May 9, 2017

View reviewed changes

weswigham added 6 commits May 9, 2017 13:41

Use damerau-levenshtein algorithm for edit distance

cf057a1

Accept new suggestion baselines

21dc0bb

Add comments, minimize memory used

d62622f

Increase cost of substitutions

085c66e

Update baselines for new weights

b0f8373

Use destrucuring for multi-swap sugar to clean up code

ed873e3

weswigham force-pushed the damerau-ive-come-to-bargain branch from 31e0d20 to ed873e3 Compare May 9, 2017 17:43

weswigham changed the base branch from spelling-correction to master May 9, 2017 17:43

sandersn requested changes May 9, 2017

View reviewed changes

mhegazy closed this May 23, 2017

microsoft locked and limited conversation to collaborators Jun 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use damerau-levenshtein algorithm for edit distance #15689

Use damerau-levenshtein algorithm for edit distance #15689

weswigham commented May 9, 2017

DanielRosenwasser May 9, 2017

weswigham May 9, 2017

DanielRosenwasser commented May 9, 2017

RyanCavanaugh May 9, 2017

weswigham May 9, 2017 •

edited

Loading

RyanCavanaugh May 9, 2017

RyanCavanaugh May 9, 2017

weswigham May 9, 2017

sandersn commented May 9, 2017

sandersn May 9, 2017

weswigham May 9, 2017

weswigham May 9, 2017 •

edited

Loading

sandersn May 9, 2017

sandersn May 9, 2017

weswigham commented May 9, 2017

weswigham commented May 9, 2017

sandersn left a comment •

edited

Loading

sandersn May 9, 2017

weswigham May 9, 2017

sandersn May 9, 2017

weswigham May 9, 2017 •

edited

Loading

weswigham commented May 9, 2017 •

edited

Loading

sandersn commented May 9, 2017

weswigham commented May 9, 2017 •

edited

Loading

RyanCavanaugh commented May 9, 2017

mhegazy commented May 23, 2017

sandersn commented May 23, 2017

		@@ -1,4 +1,4 @@
		tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2339: Property 'fn2' does not exist on type 'typeof A'.
		tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2551: Property 'fn2' does not exist on type 'typeof A'. Did you mean 'fng'?

Use damerau-levenshtein algorithm for edit distance #15689

Use damerau-levenshtein algorithm for edit distance #15689

Conversation

weswigham commented May 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DanielRosenwasser commented May 9, 2017

Choose a reason for hiding this comment

weswigham May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandersn commented May 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weswigham May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weswigham commented May 9, 2017

weswigham commented May 9, 2017

sandersn left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weswigham May 9, 2017 • edited Loading

Choose a reason for hiding this comment

weswigham commented May 9, 2017 • edited Loading

sandersn commented May 9, 2017

weswigham commented May 9, 2017 • edited Loading

RyanCavanaugh commented May 9, 2017

mhegazy commented May 23, 2017

sandersn commented May 23, 2017

weswigham May 9, 2017 •

edited

Loading

weswigham May 9, 2017 •

edited

Loading

sandersn left a comment •

edited

Loading

weswigham May 9, 2017 •

edited

Loading

weswigham commented May 9, 2017 •

edited

Loading

weswigham commented May 9, 2017 •

edited

Loading