Infinite loop on re.findall #101

ncfavier · 2021-12-15T06:09:59Z

Sometimes clipster goes into 100% CPU usage. I attached gdb to the Python process and found that it was stuck in the call to re.findall. I don't know if this is a Python bug or an instance of catastrophic backtracking. I have extract_patterns disabled so the only potential culprits are the regexes for URIs and emails, r'\b\S+://\S+\b' and r'\b\S+\@\S+\.\S+\b'. The latter looks like it could be problematic because \S matches @ and ., but I couldn't find a pathological input.

I'm disabling extract_uris and extract_emails as a workaround.

The text was updated successfully, but these errors were encountered:

mrichar1 · 2021-12-15T09:30:25Z

Hi - thanks for putting in the time to do some debugging!

I can well believe this is a backtracking issue - both regexes have the problem that there is a chance for 'mis-matching' - for urls it could be something weird like http::////http::////.com

The easy fix is probably to choose better regexes for these.I'll look at patching these today, and then we probably just have to wait and see if the infinite loop resurfaces.

Just going to mention #86 as it may be related...

mrichar1 · 2021-12-15T10:24:26Z

That's clipster updated with some new regexes - I've tried to find ones which are specific enough but not horrifically complicated that they'll lead to more problems. Now we have to just wait and see if the bug is still reproduceable!

ncfavier · 2022-03-14T22:11:35Z

Haven't run into it again, closing

ncfavier · 2022-03-14T22:14:11Z

Oh but I still have extraction disabled, so it means nothing. Guess no one else ran into it in the meantime :)

mrichar1 added the bug label Dec 15, 2021

mrichar1 added a commit that referenced this issue Dec 15, 2021

Update regexes to avoid potential catastrophic backtracking (#101).

98a1625

ncfavier closed this as completed Mar 14, 2022

mrichar1 mentioned this issue Apr 28, 2022

Double URL entries #60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop on re.findall #101

Infinite loop on re.findall #101

ncfavier commented Dec 15, 2021

mrichar1 commented Dec 15, 2021

mrichar1 commented Dec 15, 2021

ncfavier commented Mar 14, 2022

ncfavier commented Mar 14, 2022

Infinite loop on re.findall #101

Infinite loop on re.findall #101

Comments

ncfavier commented Dec 15, 2021

mrichar1 commented Dec 15, 2021

mrichar1 commented Dec 15, 2021

ncfavier commented Mar 14, 2022

ncfavier commented Mar 14, 2022