Skip to content
This repository has been archived by the owner on Mar 31, 2024. It is now read-only.

Confusing regexp in .textui.colored.clean() #170

Open
jwilk opened this issue Jan 26, 2017 · 3 comments
Open

Confusing regexp in .textui.colored.clean() #170

jwilk opened this issue Jan 26, 2017 · 3 comments

Comments

@jwilk
Copy link
Contributor

jwilk commented Jan 26, 2017

clint/textui/colored.py contains the following statement

strip = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")

It may seem that there are two set of characters in the regexp: a very long one starting with [^-_a-z... and [^\s].
But actually there is only single set of characters, with some characters (|, +, [, ] and ^) included twice.
This is super confusing.

Could you rewrite the regexp in a way that leaves no doubt what it's supposed to mean?
Using raw triple-quoted strings and eliminating unnecessary backslashes should help.

The dubious regexp was found using pydiatra.

@jorng
Copy link

jorng commented Jan 26, 2017

Not really sure where you're seeing duplicates... that is a unique list of symbols / characters. If you separate it out, they are all unique:

-
_
a-z
A-Z
0-9
!
@
#
%
&
=
,
/
'
\"
;
:
~
`
\$
\^
\*
\(
\)
\+
\[
\]
\.
\{
\}
\|
\?
\<
\>
\\

Maybe a diagram would make more sense: railroad diagram

@jwilk
Copy link
Contributor Author

jwilk commented Jan 26, 2017

Haha, I told ya it's confusing. :-P
Try this URL instead:

>>> strip = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
>>> print('https://regexper.com/#' + urllib.parse.quote(strip.pattern))
https://regexper.com/#%28%5B%5E-_a-zA-Z0-9%21%40%23%25%26%3D%2C/%27%22%3B%3A%7E%60%5C%24%5C%5E%5C%2A%5C%28%5C%29%5C%2B%5C%5B%5C%5D%5C.%5C%7B%5C%7D%5C%7C%5C%3F%5C%3C%5C%3E%5C%5D%2B%7C%5B%5E%5Cs%5D%2B%29

But in general, I wouldn't recommend using regexper.com as a debugging tool, because their regexp engine is almost certainly not the same as Python one.
This is a more robust (although not very readable) Python regexp debugging method:

>>> strip = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)", re.DEBUG)
SUBPATTERN 1
  MAX_REPEAT 1 MAXREPEAT
    IN
      NEGATE None
      LITERAL 45
      LITERAL 95
      RANGE (97, 122)
      RANGE (65, 90)
      RANGE (48, 57)
      LITERAL 33
      LITERAL 64
      LITERAL 35
      LITERAL 37
      LITERAL 38
      LITERAL 61
      LITERAL 44
      LITERAL 47
      LITERAL 39
      LITERAL 34
      LITERAL 59
      LITERAL 58
      LITERAL 126
      LITERAL 96
      LITERAL 36
      LITERAL 94
      LITERAL 42
      LITERAL 40
      LITERAL 41
      LITERAL 43
      LITERAL 91
      LITERAL 93
      LITERAL 46
      LITERAL 123
      LITERAL 125
      LITERAL 124
      LITERAL 63
      LITERAL 60
      LITERAL 62
      LITERAL 93
      LITERAL 43
      LITERAL 124
      LITERAL 91
      LITERAL 94
      CATEGORY CATEGORY_SPACE

@jorng
Copy link

jorng commented Jan 26, 2017

The link to the diagram is just for reference. I'm not using it to debug (what is there to debug?). You are the one that claims the regexp is confusing.

I still don't understand what is "confusing" about it. The expression is valid, and I'm not sure how one would simplify it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants