-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SequenceMatcher & autojunk - false negative #90825
Comments
The following two strings are identical other than the text "UNIQUESTRING". 0.99830220713073 0.99830220713073 As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct)
If I add Autojunk considers these items to be "popular" in that string: If I remove UNIQUESTRING from They're identical! In both scenarios, I don't pretend to understand what the module is doing in any detail, but this certainly seems like a false positive/negative. Python 3.8.10 |
(Like the idiot I am, the example code is wrong. In place of "UNIQUESTRING", any unique 3 character string triggers it (QQQ, EEE, ZQU...). And in those cases you get a ratio of 0.008! (and 0.993 in the other direction!) |
Gah. I mean 0.008 in both directions. I'm just going to be quiet now. :-) |
SequenceMatcher looks for the longest _contiguous_ match. "UNIQUESTRING" isn't the longest by far when autojunk is False, but is the longest when autojunk is True. All those bpopular characters then effectively prevent finding a longer match than 'QUESTR' (capital 'I" is also in bpopular) directly. The effects of autojunk can be surprising, and it would have been better if it were False by default. But I don't see anything unexpected here. Learn from experience and force it to False yourself ;-) BTW, it was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. In that context, autojunk is rarely surprising and usually helpful. But it more often backfires when comparing strings (viewed as sequences of characters) :-( |
I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected. Given this, I'd suggest the following then:
Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises. |
We can't change defaults without superb reason - Python has millions of users, and changing the output of code "that works" is almost always a non-starter. Improvements to the docs are welcome. In your example, try running this code after using autojunk=True:
That shows how '\nUN' and on & on. `QUESTER' is the longest common contiguous substring remaining. |
It would help if |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: