-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
be more liberal with escape sequences in the parser #93
Comments
This doesn't look like a panic to me---it looks like the parser is correctly producing an error. (If the parser is panic'ing, then that is a very different kind of bug!) I think the parser is just being overly strict in the escape sequences that it allows. If you try using |
Sorry, I got my terminology mixed up - I meant that the parser returns an error when I expected it not to return an error.
Edit: Encountered a few more escapes and flags that are not getting through:
|
The parser is conservative with escape sequences. They only work if it's a defined escape sequence (like By that description, it does certainly seem like But we could be more liberal. For example, none of My inclination here is to emulate other common regex parsers. Are they liberal with escapes too? If so, then I think that's good enough reason for us to be too. (I don't have time to do a survey right this minute, but I suspect they are.) |
I don't know enough about the various regex implementations to give an opinion, but here is the behavior that I've encountered. As I mentioned I've been working on implementing a user-agent parser in Rust using the BrowserScope regexes found here: https://github.com/ua-parser/uap-core. The base regexes are used in a lot of implementations, across different languages (see the project list here: https://github.com/ua-parser), so that case might be a good base for getting an idea of what the common behavior is. Here are a few cases in which the crate's behavior differs from the behavior of the other regex implementations that parse the same input:
Other regex implementations accept the above usage. From skimming the Python regex syntax it does look like escaping is reserved for special characters only, including |
It probably never will be. I'm not aware of any performant linear time algorithm that supports arbitrary look-ahead/look-behind. (This crate guarantees linear time worst case performance.) I think making the parser a little more liberal is reasonable, but expecting all regex implementations to accept a common syntax is probably unrealistic. For example, capture group naming and the way flags are passed to regexes vary greatly in the wild in my experience. |
@kzvezdarov I'm kind of curious how the Go implementation is working, since it also doesn't support arbitrary lookahead: http://play.golang.org/p/KkJw-Y2HYh --- The implementation linked should panic if it sees such a regexp. |
@BurntSushi I see, I didn't think about the time complexity of look-ahead/behind. In any case I was referring mostly to common behavior when it comes to escaping sequences, not to having a general common syntax. |
One way to fix this, as mentioned, is to simply allow escaping of any character. If the character doesn't have special significance, then the escape is a no-op. The problem with that approach is that it prevents us from adding any new escape sequences to the syntax. An alternative that might satisfy all constraints is to pick some set of characters that are typically special among regex engines (e.g., |
Chiming in here with an example from Python. While testing my Python binding, I hit this: In [29]: rure.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
---------------------------------------------------------------------------
RegexSyntaxError Traceback (most recent call last)
<ipython-input-29-68ebcfd850bc> in <module>()
----> 1 rure.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/__init__.py in compile(pattern, flags, **options)
14
15 def compile(pattern, flags=0, **options):
---> 16 return RegexObject(pattern, flags=flags, **options)
17
18
/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/regex.py in __init__(self, pattern, flags, **options)
47 self.rure_flags = self.rure_flags | rure_flag
48
---> 49 self._rure = Rure(self.pattern, flags=self.rure_flags, **options)
50
51 names = [name for name in self.capture_names()]
/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/lib.py in __init__(self, re, _pointer, flags, **options)
99 len(re),
100 flags,
--> 101 self._opts
102 )
103 else:
/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/lib.py in checked_call(fn, err, *args)
46 return res
47 elif msg.startswith(b'Error parsing regex'):
---> 48 raise exceptions.RegexSyntaxError(msg)
49 elif msg.startswith(b'Compiled regex exceeds size limit'):
50 raise exceptions.CompiledTooBigError(msg)
RegexSyntaxError: Error parsing regex near '.exe\" /c' at character offset 50: Unrecognized escape sequence: '\"'. In contrast, the Python stdlib accepts this: In [30]: re.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
Out[30]: re.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c') From its documentation:
|
The Python docs seem under specified, since @davidblewett Could you explain what you expect to happen? |
In this case, it would be sufficient for my purposes to allow escaping of However, I have a feeling there are several more instances in the codebase I'm testing that escapes characters that don't technically need to be. Python seems much more liberal on just ignoring escapes that don't change the semantics of the expression. |
While I might be OK with selectively escaping certain characters (even if they aren't special), I am definitely firmly opposed to a liberal policy. |
Yeah, I agree with that. It means auditing our current expressions for over-escaping but that isn't a huge deal. I will probably add a note to the Python binding mentioning the more strict rules. The error message is pretty clear about what needs to be fixed. |
OK, I'm going to close this. I understand that other regex engines are more liberal and therefore it might be initially more painful to port regexes because of over escaping, but there are two pretty big mitigating factors here:
Aesthetically, it also seems nice to reject superfluous escapes because... who wants more escapes than what they actually need? :P |
A bit of background: I'm trying to implement the BrowserScope user agent parser as seen here: https://github.com/ua-parser/uap-core/. The regular expression below can be found here: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L17
The crate (0.1.38) encounters an error compiling the following regex:
with the following error:
To reproduce:
The text was updated successfully, but these errors were encountered: