SequenceMatcher & autojunk - false negative #90825

jonathan-lp · 2022-02-06T21:58:46Z

BPO	46667
Nosy	@tim-one

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-02-06.21:58:45.529>
labels = ['3.8', 'type-bug', 'library']
title = 'SequenceMatcher & autojunk - false negative'
updated_at = <Date 2022-02-07.18:14:16.156>
user = 'https://bugs.python.org/jonathan-lp'

bugs.python.org fields:

activity = <Date 2022-02-07.18:14:16.156>
actor = 'tim.peters'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2022-02-06.21:58:45.529>
creator = 'jonathan-lp'
dependencies = []
files = []
hgrepos = []
issue_num = 46667
keywords = []
message_count = 6.0
messages = ['412673', '412675', '412676', '412677', '412752', '412779']
nosy_count = 2.0
nosy_names = ['tim.peters', 'jonathan-lp']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue46667'
versions = ['Python 3.8']

jonathan-lp · 2022-02-06T21:58:45Z

The following two strings are identical other than the text "UNIQUESTRING".
UNIQUESTRING is at the start of first and at the end of second.
Running the below gives the following output:

0.99830220713073
0.99830220713073
0.023769100169779286 # ratio

As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct)

from difflib import SequenceMatcher

first = """
UNIQUESTRING
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
"""


second = """

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum  UNIQUESTRING
"""

sm = SequenceMatcher(None, first, second, autojunk=False)
print(sm.real_quick_ratio())
print(sm.quick_ratio())
print(sm.ratio())

print()

sm2 = SequenceMatcher(None, second, first, autojunk=False)
print(sm2.real_quick_ratio())
print(sm2.quick_ratio())
print(sm2.ratio())

If I add autojunk=False, then I get a correct looking ratio (0.98...), however from my reading of the autojunk docs, UNIQUESTRING shouldn't be triggering it. Furthermore, looking in the code, as far as I can see autojunk is having no effect...

Autojunk considers these items to be "popular" in that string:
{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 'l', 'o', 's', ' ', 'i'}

If I remove UNIQUESTRING from first, this is the autojunk popular set:
{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 'i', 'l', 's', 'y', 'n'}

They're identical!

In both scenarios, b2j is also identical.

I don't pretend to understand what the module is doing in any detail, but this certainly seems like a false positive/negative.

Python 3.8.10

jonathan-lp · 2022-02-06T22:06:38Z

(Like the idiot I am, the example code is wrong. autojunk parameter should not be set for either of them to get the stated wrong results).

In place of "UNIQUESTRING", any unique 3 character string triggers it (QQQ, EEE, ZQU...). And in those cases you get a ratio of 0.008! (and 0.993 in the other direction!)

jonathan-lp · 2022-02-06T22:08:13Z

Gah. I mean 0.008 in both directions. I'm just going to be quiet now. :-)

tim-one · 2022-02-06T22:42:42Z

SequenceMatcher looks for the longest _contiguous_ match. "UNIQUESTRING" isn't the longest by far when autojunk is False, but is the longest when autojunk is True. All those bpopular characters then effectively prevent finding a longer match than 'QUESTR' (capital 'I" is also in bpopular) directly.

The effects of autojunk can be surprising, and it would have been better if it were False by default. But I don't see anything unexpected here. Learn from experience and force it to False yourself ;-) BTW, it was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. In that context, autojunk is rarely surprising and usually helpful. But it more often backfires when comparing strings (viewed as sequences of characters) :-(

jonathan-lp · 2022-02-07T15:01:52Z

I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected.

Given this, I'd suggest the following then:

Autojunk=False should be the default unless there's some reason to believe SequenceMatcher is mostly used for code comparisons.
If - for whatever reason - the default can't be changed, I'd suggest a nice big docs "Warning" (at a minimum a "Note") saying something like "The default autojunk=True is not suitable for normal string comparison. See autojunk for more information".
Human-friendly doc explanation for autojunk. The current explanation is only going to be helpful to the tiny fraction of users who understand the algorithm. Your explanation is a good start:
"Autojunk was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. But it more often backfires when comparing strings (viewed as sequences of characters)"

Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises.

tim-one · 2022-02-07T18:14:16Z

We can't change defaults without superb reason - Python has millions of users, and changing the output of code "that works" is almost always a non-starter.

Improvements to the docs are welcome.

In your example, try running this code after using autojunk=True:

pending = ""
for ch in first:
    if ch in sm.bpopular:
        if pending:
            print(repr(pending))
            pending = ""
    else:
        pending += ch
print(repr(pending))

That shows how first is effectively broken into tiny pieces given that the "popular" chaaracters act like walls. Here's the start of the output:

'\nUN'
'QUESTR'
'NG\nL'
'x'
'f'
'.'
'L'
'b'
"'"
'x'
'v'
'1500'
','

and on & on. `QUESTER' is the longest common contiguous substring remaining.

jeanas · 2023-02-12T00:57:55Z

It would help if get_close_matches supported an autojunk parameter. Right now, you have to copy its source code and add autojunk=False in the SequenceMatcher constructor.

jonathan-lp mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 6, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

jeanas mentioned this issue Feb 12, 2023

Fuzzy matching sometimes misses obvious candidates python-babel/babel#969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequenceMatcher & autojunk - false negative #90825

SequenceMatcher & autojunk - false negative #90825

jonathan-lp mannequin commented Feb 6, 2022

jonathan-lp mannequin commented Feb 6, 2022

jonathan-lp mannequin commented Feb 6, 2022

jonathan-lp mannequin commented Feb 6, 2022

tim-one commented Feb 6, 2022

jonathan-lp mannequin commented Feb 7, 2022

tim-one commented Feb 7, 2022

jeanas commented Feb 12, 2023

SequenceMatcher & autojunk - false negative #90825

SequenceMatcher & autojunk - false negative #90825

Comments

jonathan-lp mannequin commented Feb 6, 2022

jonathan-lp mannequin commented Feb 6, 2022

jonathan-lp mannequin commented Feb 6, 2022

jonathan-lp mannequin commented Feb 6, 2022

tim-one commented Feb 6, 2022

jonathan-lp mannequin commented Feb 7, 2022

tim-one commented Feb 7, 2022

jeanas commented Feb 12, 2023