Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop on re.findall #101

Closed
ncfavier opened this issue Dec 15, 2021 · 4 comments
Closed

Infinite loop on re.findall #101

ncfavier opened this issue Dec 15, 2021 · 4 comments
Labels

Comments

@ncfavier
Copy link

Sometimes clipster goes into 100% CPU usage. I attached gdb to the Python process and found that it was stuck in the call to re.findall. I don't know if this is a Python bug or an instance of catastrophic backtracking. I have extract_patterns disabled so the only potential culprits are the regexes for URIs and emails, r'\b\S+://\S+\b' and r'\b\S+\@\S+\.\S+\b'. The latter looks like it could be problematic because \S matches @ and ., but I couldn't find a pathological input.

I'm disabling extract_uris and extract_emails as a workaround.

@mrichar1 mrichar1 added the bug label Dec 15, 2021
@mrichar1
Copy link
Owner

Hi - thanks for putting in the time to do some debugging!

I can well believe this is a backtracking issue - both regexes have the problem that there is a chance for 'mis-matching' - for urls it could be something weird like http::////http::////.com

The easy fix is probably to choose better regexes for these.I'll look at patching these today, and then we probably just have to wait and see if the infinite loop resurfaces.

Just going to mention #86 as it may be related...

@mrichar1
Copy link
Owner

That's clipster updated with some new regexes - I've tried to find ones which are specific enough but not horrifically complicated that they'll lead to more problems. Now we have to just wait and see if the bug is still reproduceable!

@ncfavier
Copy link
Author

Haven't run into it again, closing

@ncfavier
Copy link
Author

Oh but I still have extraction disabled, so it means nothing. Guess no one else ran into it in the meantime :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants