-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use damerau-levenshtein algorithm for edit distance #15689
Conversation
src/compiler/utilities.ts
Outdated
let tmp = current[j - 1] + INSERT_COST; | ||
if (tmp < min) min = tmp; | ||
if ((tmp = previous[j - 1] + (eq ? 0 : CHANGE_COST)) < min) min = tmp; | ||
if ((tmp = (i > 1 && j > 1 && s1.charCodeAt(i - 1) === s2.charCodeAt(j - 2) && s1.charCodeAt(i - 2) === s2.charCodeAt(j - 1)) ? (prevprev[j - 2] + (eq ? 0 : TRANSPOSE_COST)) : Infinity) < min) min = tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I'm lost on the above four lines. Splitting into a few more lines, explaining what those lines do, and using more descriptive variable names would definitely help here. 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! This used to be in a Math.min call in the original implementation before I added the 4th case. Perf-wise, the manual comparisons are faster on most runtimes than the min
call, and this was just easier to debug. :P
I'm not sure how to rename the variables usefully (tmp
-> potentialDist
?), but can see how a comment on each contribution would help. (First line is cost of a deletion, next an insert, then substitution, then transposition)
I do like the new changes, though there's some new noise ( @sandersn what are your thoughts? |
@@ -196,7 +196,7 @@ tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(476,57): error T | |||
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(477,29): error TS2304: Cannot find name 'ValueLocation'. | |||
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(478,29): error TS2304: Cannot find name 'hasFlag'. | |||
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(478,55): error TS2304: Cannot find name 'VarFlags'. | |||
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(480,21): error TS2552: Cannot find name 'FieldSymbol'. Did you mean 'fieldSymbol'? | |||
tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(480,21): error TS2304: Cannot find name 'FieldSymbol'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one disappeared because the other one, earlier in the file, had a suggestion found, and there's a cap on how many suggestions are produced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's fine then
@@ -658,7 +658,7 @@ tests/cases/conformance/parser/ecmascript5/parserRealSource7.ts(828,13): error T | |||
!!! error TS2304: Cannot find name 'StringHashTable'. | |||
modType = new ModuleType(enclosedTypes, ambientEnclosedTypes); | |||
~~~~~~~~~~ | |||
!!! error TS2304: Cannot find name 'ModuleType'. | |||
!!! error TS2552: Cannot find name 'ModuleType'. Did you mean 'moduleDecl'? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too "far"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since right now we're calculating edit distance in a case-insensitive way - the edit distance is 4, which is the maximum allowable distance for a string of length 10 (I didn't change that heuristic). TBQH I think this one should have appeared before I changed the underlying algorithm. I also think there's value in including the case in the distance heuristic - it's just more information.
In any case, since transpositions are cheaper, it's probably safe to lower the heuristic for maximum edit distance.
@RyanCavanaugh points out that we should probably merge |
@@ -1,4 +1,4 @@ | |||
tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2339: Property 'fn2' does not exist on type 'typeof A'. | |||
tests/cases/conformance/internalModules/exportDeclarations/ModuleWithExportedAndNonExportedFunctions.ts(28,13): error TS2551: Property 'fn2' does not exist on type 'typeof A'. Did you mean 'fng'? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing what we want from this feature with @RyanCavanaugh, here's how we decided to summarise it: suggest corrections for misspellings that you "can't see".
That means we want very high precision (no false positives) and don't care about recall (it's ok to miss some fixes). It also means that short identifiers should only suggest a few things. Specifically, 3-character identifiers should only suggest additions, basically to cover the foo
→ _foo
/ foo1
case. fn2
→ fng
and bar
→ baz
are obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can actually adjust the algorithm to reflect this! We can just reduce the cost of an addition relative to a substitution, such that a suggestion with one character added is preferred to a suggestion with one character substituted.
The reason this suggestion was added, on inspection (since it doesn't use any transpositions), is because you weighted a substitution as 2 edits in your algorithm, whereas when I rewrote it I kept it at 1. Should I just revert the weight of substitutions to 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is the same reason why the other result @RyanCavanaugh pointed out appeared, too. Previously it had a distance of 8 with the old weighting of substitutions, which is why when I reverted it to 1 it appeared!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Substitutions should be more expensive than inserts and deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, re-baseline with the boosted sub cost and I'll take another look at the results.
@sandersn I've matched the weight you had for substitutions, how the baseline noise is nil. |
31e0d20
to
ed873e3
Compare
@sandersn rebased onto master and swapped the PR to be against master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge this until we know that special-casing substitutions will help with real misspellings.
Couple of notes:
- We should have a RWC entry of some realy mispleled javascipt code bases so we can test suggestions. I'm not sure where to get these since github generally hosts "correct" code.
- My experience with research in dialectology [1] showed that basic Levenshtein distance is pretty good, and complicating the basic algorithm improves only a little. That result might not apply to programming-language spelling suggestions though!
- Because of (2), I'd rather experiment with Double Metaphone to see whether its suggestions would be noticeably better. I didn't want to add a dependency to typescript, though.
[1] http://www.tandfonline.com/doi/abs/10.1080/09296170802514138
if (distance > worstDistance) { | ||
continue; | ||
} | ||
if (distance < 3) { | ||
if (distance <= MIN_LEV_DIST) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for efficiency's sake, we should keep the early exit at 2, to include insertions (and double insertions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make my example above revert to the prior behavior where we pick twnty
over twenty
because it's "good enough", when it's really not.
const TRANSPOSE_COST = 0.95; // Transpositions slightly cheaper than insertion/deletion/substitution to promote transposed errors | ||
export const MIN_LEV_DIST = TRANSPOSE_COST; | ||
export function dameraulevenshtein(s1: string, s2: string): number { | ||
if (s1.length > s2.length) [s1, s2] = [s2, s1]; // Swap arguments so s2 is always the short one to use minimal memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think optimisations of the n**2 algorithm are as worthwhile as just not calling it as much. (Of course I don't actually know — measurement is the right thing to do here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already reusing our arrays to save memory (and complicating the code a little to do so), so why not go all the way and use the minimal memory for the algorithm?
@sandersn You mean transpositions, right? Since substitutions you were already special-casing with the increased weight? Anyways, while I agree that a bigram approach might yield the best suggestions, and I also agree that basic levenshtein distance is decent (I've used it), the damerau-levenshtein distance is a common improvement for keyboard-based editing error estimating (it is just a levenshtein distance that calculates transpositions as a cheaper edit). While I ended up editing the original code to use comparisons instead of |
Yep, I meant transpositions. I do agree that worrying about the efficiency of an n**2 algorithm doesn't make sense. It's more important to me to keep the code simple. I'd like to see more convincing data to justify the additional code complexity. Specifically (1) improvement in some existing code base (2) a more convincing example. For (2), we're not suggesting arbitrary corrections — there's a small lexical space for most errors. I'd like to see a convincing pair of declarations that differs by one character, and a mistake that transposes the characters of the longer identifier. I thought something like var ex1Interface
var exInterface But I can't come up with a convincing transposition. Maybe `exI1nterface'? |
@sandersn I end up transposing characters I type all the time. Common transpositions include (ironically, since I've probably done it twice while writing this), include swapping So, for example: var ForKeyword;
var OfKeyword;
var which = fOKeyword; I do also think we're discarding valuable information by discarding case information 🤷♂️ . Re:
There's only a difference between the algorithms if the potential transposition overlaps with an added/deleted character in one of the potential options. The net effect is such that a suggestion which has the right characters, but in the wrong order, is preferred over a suggestion which has a differing number of characters. Re:
I don't think an existing code base normally contains spelling errors, since it wouldn't compile then 😸 |
Given the additional value we get from this change that we do not have with the plain old Liechtenstein implementation vs the complexity added, i would say it does not seem to be worth doing. in the future, we could upgrade to metaphone 2 to get better suggestions here. |
💡 "Liechtenstein" -- did you mean "Levenshtein"? |
This PR is opened against #15507 to propose using the damerau-levenshtein edit distance algorithm instead of just levenshtein distance - the primary differentiator being that the damerau variant finds adjacent transpositions and calls them a single edit operation, lowering the edit distance for that score. Additionally, I've further tweaked the algorithm to consider transpositions as ever so slightly cheaper than additions/deletions/substitutions, so that a transposed similar name, all other options considered, is guaranteed to come first.
In effect, this means that in a situation like this one (attached as a test case to highlight the difference):
With normal levenshtein we'd suggest
twnty
, but withdamerau-levenshtein
bothtwnty
andtwenty
are equally scored, and with my slight alteration to the weights,twenty
ranks slightly cheaper and becomes the suggestion, which, I argue, is the better suggestion (and in the general case, I believe a transposition to be a more likely error on a keyboard over an omission, addition, or substitution, all other things equal - unfortunately I only have my gut and experience to back me up on this, not data).A few new things also got suggestions in the baselines with this change (desirable, looks like) thanks to the often lowered cost.
The downside to the
damerau-levenshtein
algorithm is that it needs to keep an extra row of the table around (since to score a transposition it needs the score from two positions ago two rows ago), and it needs to make a few extra comparisons to check for transpositions. However with the top-level bucketing of obviously bad matches, I think the extra cost is likely worth the improved suggestion ranking.