-
Notifications
You must be signed in to change notification settings - Fork 19
Confusing regexp in .textui.colored.clean() #170
Comments
Not really sure where you're seeing duplicates... that is a unique list of symbols / characters. If you separate it out, they are all unique:
Maybe a diagram would make more sense: railroad diagram |
Haha, I told ya it's confusing. :-P >>> strip = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
>>> print('https://regexper.com/#' + urllib.parse.quote(strip.pattern))
https://regexper.com/#%28%5B%5E-_a-zA-Z0-9%21%40%23%25%26%3D%2C/%27%22%3B%3A%7E%60%5C%24%5C%5E%5C%2A%5C%28%5C%29%5C%2B%5C%5B%5C%5D%5C.%5C%7B%5C%7D%5C%7C%5C%3F%5C%3C%5C%3E%5C%5D%2B%7C%5B%5E%5Cs%5D%2B%29 But in general, I wouldn't recommend using regexper.com as a debugging tool, because their regexp engine is almost certainly not the same as Python one. >>> strip = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)", re.DEBUG)
SUBPATTERN 1
MAX_REPEAT 1 MAXREPEAT
IN
NEGATE None
LITERAL 45
LITERAL 95
RANGE (97, 122)
RANGE (65, 90)
RANGE (48, 57)
LITERAL 33
LITERAL 64
LITERAL 35
LITERAL 37
LITERAL 38
LITERAL 61
LITERAL 44
LITERAL 47
LITERAL 39
LITERAL 34
LITERAL 59
LITERAL 58
LITERAL 126
LITERAL 96
LITERAL 36
LITERAL 94
LITERAL 42
LITERAL 40
LITERAL 41
LITERAL 43
LITERAL 91
LITERAL 93
LITERAL 46
LITERAL 123
LITERAL 125
LITERAL 124
LITERAL 63
LITERAL 60
LITERAL 62
LITERAL 93
LITERAL 43
LITERAL 124
LITERAL 91
LITERAL 94
CATEGORY CATEGORY_SPACE |
The link to the diagram is just for reference. I'm not using it to debug (what is there to debug?). You are the one that claims the regexp is confusing. I still don't understand what is "confusing" about it. The expression is valid, and I'm not sure how one would simplify it. |
clint/textui/colored.py
contains the following statementIt may seem that there are two set of characters in the regexp: a very long one starting with
[^-_a-z
... and[^\s]
.But actually there is only single set of characters, with some characters (
|
,+
,[
,]
and^
) included twice.This is super confusing.
Could you rewrite the regexp in a way that leaves no doubt what it's supposed to mean?
Using raw triple-quoted strings and eliminating unnecessary backslashes should help.
The dubious regexp was found using pydiatra.
The text was updated successfully, but these errors were encountered: