Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions findspam.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,34 @@ def misleading_link(s, site):
return False, ''


# noinspection PyUnusedLocal,PyMissingTypeHints,PyTypeChecker
@create_rule("text repeated in {}", title=False, body_summary=True, max_rep=10000, max_score=10000)
def body_text_repeated(s, site):
"""
Do some hacks to reduce the need for regex backtracking for this rule
"""
s = s.rstrip("\n")
if s.startswith("<p>") and s.endswith("</p>"):
s = s[3:-4]
initial_words = regex.match(r"\A([^\W_]+)[\W_]+([^\W_]+)[\W_]+([^\W_]+)", s)
if not initial_words:
return False, ""
escaped_initial_words = [regex.escape(x) for x in initial_words.groups()]
period = regex.match(
Copy link
Contributor

@makyen makyen May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add regex.cache_all(False) prior to this regex.match() and regex.cache_all(True) after it to prevent the single-use regexes compiled here from filling the regex package's internal cache of compiled regular expressions.

Note: this is probably something which we should be doing in a substantial number of places in the code, but I was reminded when creating PR #7012 that it should be used here. Basically, we should be doing that wherever we use a regex which is unique per-post. We should also change things in individual functions/modules to separately store compiled regexes which are used repeatedly in order to reduce the number which we rely on the regex package to cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've struck-out my earlier comments here and closed an open PR I had, as regex.cache_all() is the wrong way to handle this here, or anywhere in SD, due to threading. I'll be writing a helper function which can be used for explicit regex compiles to force the implicit regex cache to not be used. I'll update this again once I've created that (soon). I'll also create an issue in the regex package asking to expose an option to be able to not use the implicit cache on a per-compile basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a PR, #7025, which adds regex_compile_no_cache() to helpers.py and imports it to findspam.py. That function can be used to compile a regular expression and not have it placed in the regex package's implicit cache. Using that function, the code here could be something like:

    period_regex = regex_compile_no_cache(
        r"\A%s[\W_]+%s[\W_]+%s[\W_]+(.{1,40}?)%s[\W_]+%s[\W_]+%s(?=$|[\W_])" % (
            tuple(escaped_initial_words * 2))
    period = period_regex.match(s)
    if not period:
        return False, ""
    period_words = regex.split(r"[\W_]+", period.groups(0)[0])
    escaped_words = escaped_initial_words + [
        regex.escape(x) for x in period_words]
    repeats_regex = regex_compile_no_cache(r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}")
    repeats = repeats_regex.match(s)

r"\A%s[\W_]+%s[\W_]+%s[\W_]+(.{1,40}?)%s[\W_]+%s[\W_]+%s(?=$|[\W_])" % (
tuple(escaped_initial_words * 2)), s)
if not period:
return False, ""
period_words = regex.split(r"[\W_]+", period.groups(0)[0])
escaped_words = escaped_initial_words + [
regex.escape(x) for x in period_words]
repeats_regex = r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}"
repeats = regex.match(repeats_regex, s)
Copy link
Contributor

@makyen makyen May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same request here regarding regex.cache_all(False) prior to this regex.match() and regex.cache_all(True) after it.

Copy link
Contributor

@makyen makyen May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, it would be possible to use only a single pair of regex.cache_all(False) and regex.cache_all(True) to cover both post-specific uses of a regular expression.

This also touches on the larger issue of how we use the regex package's cache, particularly within the detections. The larger issue is that we don't want to be using it for regexes which are individually generated specific to each post, but we do want any static regular expressions not to have to be recompiled for every use (e.g. through using the regex package's cache). There are a couple of different approaches which we could take in general. I'll open an Issue to cover/discuss the topic.

if repeats:
return True, "Body contains repeated phrase '%s'" % repeats.groups(0)[0]
return False, ""


# noinspection PyUnusedLocal,PyMissingTypeHints,PyTypeChecker
@create_rule("repeating words in {}", max_rep=11, stripcodeblocks=True)
def has_repeating_words(s, site):
Expand Down
1 change: 1 addition & 0 deletions test/test_findspam.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@
('homoglyph phone numbers 07', '<p>Some 1-844i8O2i7S3S fbody</p>', 'a username', 'math.stackexchange.com', False, False, True),
('homoglyph phone numbers 08', '<p>Some 844-8O2-7S3S foobody</p>', 'a username', 'math.stackexchange.com', False, False, True),
('Multiple consecutive homoglyph numbers 1', '<p>SomeI-888-884-Olll 888-884-OIII +I-972-S34-S446 972-S34-S446 I-628-21S-2I66 628-21S-2l66 1-844i8O2i7S3S 844a8O2a7S3S body</p>', 'a username', 'math.stackexchange.com', False, False, True),
('repeated body test', 'need enough interesting text to avoid few unique characters rule' * 15, 'luser', 'stackoverflow.com', False, False, True),
])
def test_findspam(title, body, username, site, body_is_summary, is_answer, expected_spam):
post = Post(api_response={'title': title, 'body': body,
Expand Down