Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SequenceMatcher & autojunk - false negative #90825

Open
jonathan-lp mannequin opened this issue Feb 6, 2022 · 7 comments
Open

SequenceMatcher & autojunk - false negative #90825

jonathan-lp mannequin opened this issue Feb 6, 2022 · 7 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@jonathan-lp
Copy link
Mannequin

jonathan-lp mannequin commented Feb 6, 2022

BPO 46667
Nosy @tim-one

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-02-06.21:58:45.529>
labels = ['3.8', 'type-bug', 'library']
title = 'SequenceMatcher & autojunk - false negative'
updated_at = <Date 2022-02-07.18:14:16.156>
user = 'https://bugs.python.org/jonathan-lp'

bugs.python.org fields:

activity = <Date 2022-02-07.18:14:16.156>
actor = 'tim.peters'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2022-02-06.21:58:45.529>
creator = 'jonathan-lp'
dependencies = []
files = []
hgrepos = []
issue_num = 46667
keywords = []
message_count = 6.0
messages = ['412673', '412675', '412676', '412677', '412752', '412779']
nosy_count = 2.0
nosy_names = ['tim.peters', 'jonathan-lp']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue46667'
versions = ['Python 3.8']

@jonathan-lp
Copy link
Mannequin Author

jonathan-lp mannequin commented Feb 6, 2022

The following two strings are identical other than the text "UNIQUESTRING".
UNIQUESTRING is at the start of first and at the end of second.
Running the below gives the following output:

0.99830220713073
0.99830220713073
0.023769100169779286 # ratio

0.99830220713073
0.99830220713073
0.023769100169779286 # ratio

As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct)

from difflib import SequenceMatcher

first = """
UNIQUESTRING
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
"""


second = """

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum  UNIQUESTRING
"""

sm = SequenceMatcher(None, first, second, autojunk=False)
print(sm.real_quick_ratio())
print(sm.quick_ratio())
print(sm.ratio())

print()

sm2 = SequenceMatcher(None, second, first, autojunk=False)
print(sm2.real_quick_ratio())
print(sm2.quick_ratio())
print(sm2.ratio())

If I add autojunk=False, then I get a correct looking ratio (0.98...), however from my reading of the autojunk docs, UNIQUESTRING shouldn't be triggering it. Furthermore, looking in the code, as far as I can see autojunk is having no effect...

Autojunk considers these items to be "popular" in that string:
{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 'l', 'o', 's', ' ', 'i'}

If I remove UNIQUESTRING from first, this is the autojunk popular set:
{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 'i', 'l', 's', 'y', 'n'}

They're identical!

In both scenarios, b2j is also identical.

I don't pretend to understand what the module is doing in any detail, but this certainly seems like a false positive/negative.

Python 3.8.10

@jonathan-lp jonathan-lp mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 6, 2022
@jonathan-lp
Copy link
Mannequin Author

jonathan-lp mannequin commented Feb 6, 2022

(Like the idiot I am, the example code is wrong. autojunk parameter should not be set for either of them to get the stated wrong results).

In place of "UNIQUESTRING", any unique 3 character string triggers it (QQQ, EEE, ZQU...). And in those cases you get a ratio of 0.008! (and 0.993 in the other direction!)

@jonathan-lp
Copy link
Mannequin Author

jonathan-lp mannequin commented Feb 6, 2022

Gah. I mean 0.008 in both directions. I'm just going to be quiet now. :-)

@tim-one
Copy link
Member

tim-one commented Feb 6, 2022

SequenceMatcher looks for the longest _contiguous_ match. "UNIQUESTRING" isn't the longest by far when autojunk is False, but is the longest when autojunk is True. All those bpopular characters then effectively prevent finding a longer match than 'QUESTR' (capital 'I" is also in bpopular) directly.

The effects of autojunk can be surprising, and it would have been better if it were False by default. But I don't see anything unexpected here. Learn from experience and force it to False yourself ;-) BTW, it was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. In that context, autojunk is rarely surprising and usually helpful. But it more often backfires when comparing strings (viewed as sequences of characters) :-(

@jonathan-lp
Copy link
Mannequin Author

jonathan-lp mannequin commented Feb 7, 2022

I still don't get how UNIQUESTRING is the longest even with autojunk=True, but that's an implementation detail and I'll trust you that it's working as expected.

Given this, I'd suggest the following then:

  • Autojunk=False should be the default unless there's some reason to believe SequenceMatcher is mostly used for code comparisons.

  • If - for whatever reason - the default can't be changed, I'd suggest a nice big docs "Warning" (at a minimum a "Note") saying something like "The default autojunk=True is not suitable for normal string comparison. See autojunk for more information".

  • Human-friendly doc explanation for autojunk. The current explanation is only going to be helpful to the tiny fraction of users who understand the algorithm. Your explanation is a good start:
    "Autojunk was introduced as a way to greatly speed comparing files of code, viewing them as sequences of lines. But it more often backfires when comparing strings (viewed as sequences of characters)"

Put simply: The current docs aren't helpful to users who don't have text matching expertise, nor do they emphasise the huge caveat that autojunk=True raises.

@tim-one
Copy link
Member

tim-one commented Feb 7, 2022

We can't change defaults without superb reason - Python has millions of users, and changing the output of code "that works" is almost always a non-starter.

Improvements to the docs are welcome.

In your example, try running this code after using autojunk=True:

pending = ""
for ch in first:
    if ch in sm.bpopular:
        if pending:
            print(repr(pending))
            pending = ""
    else:
        pending += ch
print(repr(pending))

That shows how first is effectively broken into tiny pieces given that the "popular" chaaracters act like walls. Here's the start of the output:

'\nUN'
'QUESTR'
'NG\nL'
'x'
'f'
'.'
'L'
'b'
"'"
'x'
'v'
'1500'
','

and on & on. `QUESTER' is the longest common contiguous substring remaining.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@jeanas
Copy link
Contributor

jeanas commented Feb 12, 2023

It would help if get_close_matches supported an autojunk parameter. Right now, you have to copy its source code and add autojunk=False in the SequenceMatcher constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants