Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use damerau-levenshtein algorithm for edit distance #15689

Closed

Conversation

weswigham
Copy link
Member

This PR is opened against #15507 to propose using the damerau-levenshtein edit distance algorithm instead of just levenshtein distance - the primary differentiator being that the damerau variant finds adjacent transpositions and calls them a single edit operation, lowering the edit distance for that score. Additionally, I've further tweaked the algorithm to consider transpositions as ever so slightly cheaper than additions/deletions/substitutions, so that a transposed similar name, all other options considered, is guaranteed to come first.

In effect, this means that in a situation like this one (attached as a test case to highlight the difference):

var twnty;
var twenty;
var which = tewnty; 

With normal levenshtein we'd suggest twnty, but with damerau-levenshtein both twnty and twenty are equally scored, and with my slight alteration to the weights, twenty ranks slightly cheaper and becomes the suggestion, which, I argue, is the better suggestion (and in the general case, I believe a transposition to be a more likely error on a keyboard over an omission, addition, or substitution, all other things equal - unfortunately I only have my gut and experience to back me up on this, not data).

A few new things also got suggestions in the baselines with this change (desirable, looks like) thanks to the often lowered cost.

The downside to the damerau-levenshtein algorithm is that it needs to keep an extra row of the table around (since to score a transposition it needs the score from two positions ago two rows ago), and it needs to make a few extra comparisons to check for transpositions. However with the top-level bucketing of obviously bad matches, I think the extra cost is likely worth the improved suggestion ranking.

let tmp = current[j - 1] + INSERT_COST;
if (tmp < min) min = tmp;
if ((tmp = previous[j - 1] + (eq ? 0 : CHANGE_COST)) < min) min = tmp;
if ((tmp = (i > 1 && j > 1 && s1.charCodeAt(i - 1) === s2.charCodeAt(j - 2) && s1.charCodeAt(i - 2) === s2.charCodeAt(j - 1)) ? (prevprev[j - 2] + (eq ? 0 : TRANSPOSE_COST)) : Infinity) < min) min = tmp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm lost on the above four lines. Splitting into a few more lines, explaining what those lines do, and using more descriptive variable names would definitely help here. 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! This used to be in a Math.min call in the original implementation before I added the 4th case. Perf-wise, the manual comparisons are faster on most runtimes than the min call, and this was just easier to debug. :P

I'm not sure how to rename the variables usefully (tmp -> potentialDist?), but can see how a comment on each contribution would help. (First line is cost of a deletion, next an insert, then substitution, then transposition)

@DanielRosenwasser
Copy link
Member

I do like the new changes, though there's some new noise (Date <-> data, but that's expected).

@sandersn what are your thoughts?

@@ -196,7 +196,7 @@ tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(476,57): error T
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(477,29): error TS2304: Cannot find name 'ValueLocation'.
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(478,29): error TS2304: Cannot find name 'hasFlag'.
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(478,55): error TS2304: Cannot find name 'VarFlags'.
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(480,21): error TS2552: Cannot find name 'FieldSymbol'. Did you mean 'fieldSymbol'?
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(480,21): error TS2304: Cannot find name 'FieldSymbol'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression

Copy link
Member Author

@weswigham weswigham May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one disappeared because the other one, earlier in the file, had a suggestion found, and there's a cap on how many suggestions are produced?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's fine then

@@ -658,7 +658,7 @@ tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(828,13): error T
!!! error TS2304: Cannot find name 'StringHashTable'.
modType = new ModuleType(enclosedTypes, ambientEnclosedTypes);
~~~~~~~~~~
!!! error TS2304: Cannot find name 'ModuleType'.
!!! error TS2552: Cannot find name 'ModuleType'. Did you mean 'moduleDecl'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too "far"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since right now we're calculating edit distance in a case-insensitive way - the edit distance is 4, which is the maximum allowable distance for a string of length 10 (I didn't change that heuristic). TBQH I think this one should have appeared before I changed the underlying algorithm. I also think there's value in including the case in the distance heuristic - it's just more information.

In any case, since transpositions are cheaper, it's probably safe to lower the heuristic for maximum edit distance.

@sandersn
Copy link
Member

sandersn commented May 9, 2017

@RyanCavanaugh points out that we should probably merge spelling-correction first. I'll do that and then comment on this PR.

@@ -1,4 +1,4 @@
tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2339: Property 'fn2' does not exist on type 'typeof A'.
tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2551: Property 'fn2' does not exist on type 'typeof A'. Did you mean 'fng'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing what we want from this feature with @RyanCavanaugh, here's how we decided to summarise it: suggest corrections for misspellings that you "can't see".

That means we want very high precision (no false positives) and don't care about recall (it's ok to miss some fixes). It also means that short identifiers should only suggest a few things. Specifically, 3-character identifiers should only suggest additions, basically to cover the foo_foo / foo1 case. fn2fng and barbaz are obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually adjust the algorithm to reflect this! We can just reduce the cost of an addition relative to a substitution, such that a suggestion with one character added is preferred to a suggestion with one character substituted.

The reason this suggestion was added, on inspection (since it doesn't use any transpositions), is because you weighted a substitution as 2 edits in your algorithm, whereas when I rewrote it I kept it at 1. Should I just revert the weight of substitutions to 2?

Copy link
Member Author

@weswigham weswigham May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is the same reason why the other result @RyanCavanaugh pointed out appeared, too. Previously it had a distance of 8 with the old weighting of substitutions, which is why when I reverted it to 1 it appeared!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Substitutions should be more expensive than inserts and deletes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, re-baseline with the boosted sub cost and I'll take another look at the results.

@weswigham
Copy link
Member Author

@sandersn I've matched the weight you had for substitutions, how the baseline noise is nil.

@weswigham weswigham force-pushed the damerau-ive-come-to-bargain branch from 31e0d20 to ed873e3 Compare May 9, 2017 17:43
@weswigham weswigham changed the base branch from spelling-correction to master May 9, 2017 17:43
@weswigham
Copy link
Member Author

@sandersn rebased onto master and swapped the PR to be against master.

Copy link
Member

@sandersn sandersn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge this until we know that special-casing substitutions will help with real misspellings.

Couple of notes:

  1. We should have a RWC entry of some realy mispleled javascipt code bases so we can test suggestions. I'm not sure where to get these since github generally hosts "correct" code.
  2. My experience with research in dialectology [1] showed that basic Levenshtein distance is pretty good, and complicating the basic algorithm improves only a little. That result might not apply to programming-language spelling suggestions though!
  3. Because of (2), I'd rather experiment with Double Metaphone to see whether its suggestions would be noticeably better. I didn't want to add a dependency to typescript, though.

[1] http://www.tandfonline.com/doi/abs/10.1080/09296170802514138

if (distance > worstDistance) {
continue;
}
if (distance < 3) {
if (distance <= MIN_LEV_DIST) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for efficiency's sake, we should keep the early exit at 2, to include insertions (and double insertions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make my example above revert to the prior behavior where we pick twnty over twenty because it's "good enough", when it's really not.

const TRANSPOSE_COST = 0.95; // Transpositions slightly cheaper than insertion/deletion/substitution to promote transposed errors
export const MIN_LEV_DIST = TRANSPOSE_COST;
export function dameraulevenshtein(s1: string, s2: string): number {
if (s1.length > s2.length) [s1, s2] = [s2, s1]; // Swap arguments so s2 is always the short one to use minimal memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think optimisations of the n**2 algorithm are as worthwhile as just not calling it as much. (Of course I don't actually know — measurement is the right thing to do here.)

Copy link
Member Author

@weswigham weswigham May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already reusing our arrays to save memory (and complicating the code a little to do so), so why not go all the way and use the minimal memory for the algorithm?

@weswigham
Copy link
Member Author

weswigham commented May 9, 2017

@sandersn You mean transpositions, right? Since substitutions you were already special-casing with the increased weight?

Anyways, while I agree that a bigram approach might yield the best suggestions, and I also agree that basic levenshtein distance is decent (I've used it), the damerau-levenshtein distance is a common improvement for keyboard-based editing error estimating (it is just a levenshtein distance that calculates transpositions as a cheaper edit). While I ended up editing the original code to use comparisons instead of math.min and to use constants for weights, the only real difference between the original algorithm and the damerau variant is the addition of an extra row of history and the extra check for transpositions to detect them and reduce their cost. I also provided a compelling example of where the strict levenshtein distance provides a subpar suggestion by comparison; I think the cost of the improvement, compute-wise, is actually pretty low. I've seen variants with configurable per-pair weights (eg, a substitution of f to g can cost less than f to m, with weight based on keyboard distance), which could, again, slightly improve suggestions, but is way more work to generate the weights for than I feel like is reasonable (and would require including a big table of weights in our source).

@sandersn
Copy link
Member

sandersn commented May 9, 2017

Yep, I meant transpositions.

I do agree that worrying about the efficiency of an n**2 algorithm doesn't make sense. It's more important to me to keep the code simple.

I'd like to see more convincing data to justify the additional code complexity. Specifically (1) improvement in some existing code base (2) a more convincing example.

For (2), we're not suggesting arbitrary corrections — there's a small lexical space for most errors. I'd like to see a convincing pair of declarations that differs by one character, and a mistake that transposes the characters of the longer identifier.

I thought something like

var ex1Interface
var exInterface

But I can't come up with a convincing transposition. Maybe `exI1nterface'?

@weswigham
Copy link
Member Author

weswigham commented May 9, 2017

@sandersn I end up transposing characters I type all the time. Common transpositions include (ironically, since I've probably done it twice while writing this), include swapping n and s (Hands type independently, one hand types a bit faster than the other, comes down out of order). Also (again, a transposition error I made and corrected while typing this) o and f while writing of.

So, for example:

var ForKeyword;
var OfKeyword;
var which = fOKeyword;

I do also think we're discarding valuable information by discarding case information 🤷‍♂️ .

Re:

But I can't come up with a convincing transposition. Maybe `exI1nterface'?

There's only a difference between the algorithms if the potential transposition overlaps with an added/deleted character in one of the potential options.

The net effect is such that a suggestion which has the right characters, but in the wrong order, is preferred over a suggestion which has a differing number of characters.

Re:

improvement in some existing code base

I don't think an existing code base normally contains spelling errors, since it wouldn't compile then 😸

@RyanCavanaugh
Copy link
Member

Typos in the wild from StackOverflow (why can't Angular programmers spell?):

timeooutProvider
timeoutProvider

ngif
ngIf

angluar
angular

ation
action

itemStockPositonService
itemStockPositionService

parent.bankDrafts
$parent.bankDrafts

posts.categoria
post.categoria

HTTP
Http

colums
columns

lable
label

taskService
tasksService

extCheckbox
ext-Checkbox

http://stackoverflow.com/questions/25011510/how-can-i-inject-timeout-into-my-service/25011566#25011566
http://stackoverflow.com/questions/40426432/angular-2-cant-bind-to-ngif-since-it-isnt-a-known-property/40426449#40426449
http://stackoverflow.com/questions/43208950/angular2-typescript-error-ts2307-cannot-find-module-angular2-forms/43213158?s=44|1.0266#43213158
http://stackoverflow.com/questions/42393555/angular-2-data-binding-with-function-call/42394390#42394390
http://stackoverflow.com/questions/38849212/property-xxxservice-does-not-exists-on-type-xxxdetailspage/38849714#38849714
http://stackoverflow.com/questions/40702109/knockout-typescript-table-not-rendering-data/40703624#40703624
http://stackoverflow.com/questions/42867925/data-binding-angular-2-object-object-not-show-on-view/42868000#42868000
http://stackoverflow.com/questions/38227744/angular-2-typescript-cant-resolve-all-parameters-for-tooltipservice/38227787#38227787
http://stackoverflow.com/questions/39608919/ng2-table-not-working-with-latest-angular2-version/39609411?s=69|0.9176#39609411
http://stackoverflow.com/questions/37667397/reusing-ng2-components-in-different-nativescript-layout-positions/37697997?s=79|0.9001#37697997
http://stackoverflow.com/questions/38987032/angular-2-block-scoped-variable-issues/38990019?s=100|0.8663#38990019
http://stackoverflow.com/questions/39080823/how-to-use-angular-1-directives-in-angular-2/42252391?s=127|0.7663#42252391

@mhegazy
Copy link
Contributor

mhegazy commented May 23, 2017

Given the additional value we get from this change that we do not have with the plain old Liechtenstein implementation vs the complexity added, i would say it does not seem to be worth doing. in the future, we could upgrade to metaphone 2 to get better suggestions here.

@mhegazy mhegazy closed this May 23, 2017
@sandersn
Copy link
Member

💡 "Liechtenstein" -- did you mean "Levenshtein"?

@microsoft microsoft locked and limited conversation to collaborators Jun 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants