be more liberal with escape sequences in the parser #93

kzvezdarov · 2015-06-21T15:50:22Z

A bit of background: I'm trying to implement the BrowserScope user agent parser as seen here: https://github.com/ua-parser/uap-core/. The regular expression below can be found here: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L17

The crate (0.1.38) encounters an error compiling the following regex:

/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?

with the following error:

Syntax(Error { pos: 92, surround: "p)[ \\-](\\d", kind: UnrecognizedEscape('-') })'

To reproduce:

use regex::Regex;
let re = Regex::new(r"/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?").unwrap();

The text was updated successfully, but these errors were encountered:

BurntSushi · 2015-06-21T15:55:18Z

This doesn't look like a panic to me---it looks like the parser is correctly producing an error. (If the parser is panic'ing, then that is a very different kind of bug!)

I think the parser is just being overly strict in the escape sequences that it allows. If you try using [- ] instead, does that work?

kzvezdarov · 2015-06-21T16:19:57Z

Sorry, I got my terminology mixed up - I meant that the parser returns an error when I expected it not to return an error.

[- ] does work. Is there a way to relax the strictness?

Edit: Encountered a few more escapes and flags that are not getting through:

Unrecognized escape sequence: '\/'.
Unrecognized flag: '!'. (Allowed flags: i, s, m, U, x.)
Unrecognized escape sequence: '\ '.

BurntSushi · 2015-06-21T23:28:38Z

The parser is conservative with escape sequences. They only work if it's a defined escape sequence (like \b) or if they are escaping characters that are interpreted specially by the regex. See the source here for all the details.

By that description, it does certainly seem like - should be escapeable, since it has special significance in character classes.

But we could be more liberal. For example, none of !, / and ever have special significance in a regex (for this crate). But it still might be convenient to permit any character to be escaped, even if it doesn't have to be.

My inclination here is to emulate other common regex parsers. Are they liberal with escapes too? If so, then I think that's good enough reason for us to be too. (I don't have time to do a survey right this minute, but I suspect they are.)

kzvezdarov · 2015-06-22T00:52:37Z

I don't know enough about the various regex implementations to give an opinion, but here is the behavior that I've encountered.

As I mentioned I've been working on implementing a user-agent parser in Rust using the BrowserScope regexes found here: https://github.com/ua-parser/uap-core. The base regexes are used in a lot of implementations, across different languages (see the project list here: https://github.com/ua-parser), so that case might be a good base for getting an idea of what the common behavior is.

Here are a few cases in which the crate's behavior differs from the behavior of the other regex implementations that parse the same input:

Escaping -:
e.g. (Obigo)\-Browser near igo)\-Brow, character offset 8.
Escaping /:
e.g. (Sony)(?:BDP\/|\/)?([^ /;\)]+)[ /;\)] near :BDP\/|\/), character offset 13.
Does not recognize flag !:
e.g. Windows Phone [^;]+; (?:(?!IEMobile/).)*IEMobile/[^;\)]+[;\)] ?(?:ARM; ?Touch; ?|Touch; ?|WpsLondonTest; ?)?([^;]+); *([^;,\)]+) near (?:(?!IEMo, character offset 26.

Other regex implementations accept the above usage. From skimming the Python regex syntax it does look like escaping is reserved for special characters only, including -, /. Lastly, negating grouping flags (as in case 3) is not yet supported by the crate?

BurntSushi · 2015-06-22T01:32:37Z

Lastly, negating grouping flags (as in case 3) is not yet supported by the crate?

It probably never will be. I'm not aware of any performant linear time algorithm that supports arbitrary look-ahead/look-behind. (This crate guarantees linear time worst case performance.)

I think making the parser a little more liberal is reasonable, but expecting all regex implementations to accept a common syntax is probably unrealistic. For example, capture group naming and the way flags are passed to regexes vary greatly in the wild in my experience.

BurntSushi · 2015-06-22T01:39:23Z

@kzvezdarov I'm kind of curious how the Go implementation is working, since it also doesn't support arbitrary lookahead: http://play.golang.org/p/KkJw-Y2HYh --- The implementation linked should panic if it sees such a regexp.

kzvezdarov · 2015-06-22T02:56:20Z

@BurntSushi I see, I didn't think about the time complexity of look-ahead/behind. In any case I was referring mostly to common behavior when it comes to escaping sequences, not to having a general common syntax.
The Go implementation does not support arbitrary lookahead. Their submodule is using the 0.4 release from Nov 2014.

BurntSushi · 2016-03-13T19:07:03Z

One way to fix this, as mentioned, is to simply allow escaping of any character. If the character doesn't have special significance, then the escape is a no-op.

The problem with that approach is that it prevents us from adding any new escape sequences to the syntax.

An alternative that might satisfy all constraints is to pick some set of characters that are typically special among regex engines (e.g., ! would be one), and permit no-op escaping for those. Other escape sequences would be disallowed as they are today, which gives us some wiggle room for growth. (Hopefully not necessary, but leaving that door open seems wise.)

davidblewett · 2016-11-16T14:34:42Z

Chiming in here with an example from Python. While testing my Python binding, I hit this:

In [29]: rure.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
---------------------------------------------------------------------------
RegexSyntaxError                          Traceback (most recent call last)
<ipython-input-29-68ebcfd850bc> in <module>()
----> 1 rure.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/__init__.py in compile(pattern, flags, **options)
     14 
     15 def compile(pattern, flags=0, **options):
---> 16     return RegexObject(pattern, flags=flags, **options)
     17 
     18 

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/regex.py in __init__(self, pattern, flags, **options)
     47                     self.rure_flags = self.rure_flags | rure_flag
     48 
---> 49         self._rure = Rure(self.pattern, flags=self.rure_flags, **options)
     50 
     51         names = [name for name in self.capture_names()]

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/lib.py in __init__(self, re, _pointer, flags, **options)
     99                 len(re),
    100                 flags,
--> 101                 self._opts
    102             )
    103         else:

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/lib.py in checked_call(fn, err, *args)
     46         return res
     47     elif msg.startswith(b'Error parsing regex'):
---> 48         raise exceptions.RegexSyntaxError(msg)
     49     elif msg.startswith(b'Compiled regex exceeds size limit'):
     50         raise exceptions.CompiledTooBigError(msg)

RegexSyntaxError: Error parsing regex near '.exe\" /c' at character offset 50: Unrecognized escape sequence: '\"'.

In contrast, the Python stdlib accepts this:

    In [30]: re.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
    Out[30]: re.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')

From its documentation:

'\'
Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.

BurntSushi · 2016-11-16T14:39:08Z

The Python docs seem under specified, since " is neither a special character nor a special sequence.

@davidblewett Could you explain what you expect to happen?

davidblewett · 2016-11-16T14:42:36Z

In this case, it would be sufficient for my purposes to allow escaping of ". I also just hit an escaped /, :, &, %, '.

However, I have a feeling there are several more instances in the codebase I'm testing that escapes characters that don't technically need to be. Python seems much more liberal on just ignoring escapes that don't change the semantics of the expression.

BurntSushi · 2016-11-16T15:51:09Z

Python seems much more liberal on just ignoring escapes that don't change the semantics of the expression.

While I might be OK with selectively escaping certain characters (even if they aren't special), I am definitely firmly opposed to a liberal policy.

davidblewett · 2016-11-16T16:22:05Z

While I might be OK with selectively escaping certain characters (even if they aren't special), I am definitely firmly opposed to a liberal policy.

Yeah, I agree with that. It means auditing our current expressions for over-escaping but that isn't a huge deal. I will probably add a note to the Python binding mentioning the more strict rules. The error message is pretty clear about what needs to be fixed.

BurntSushi · 2016-12-29T01:05:58Z

OK, I'm going to close this. I understand that other regex engines are more liberal and therefore it might be initially more painful to port regexes because of over escaping, but there are two pretty big mitigating factors here:

As @davidblewett said, the error message is clear and it's easy to fix.
Rejecting invalid escapes leaves room for future expansion.

Aesthetically, it also seems nice to reject superfluous escapes because... who wants more escapes than what they actually need? :P

kzvezdarov changed the title ~~Unexpected panic with UnrecognizedEscape('-')~~ Unexpected error with UnrecognizedEscape('-') Jun 21, 2015

BurntSushi changed the title ~~Unexpected error with UnrecognizedEscape('-')~~ be more liberal with escape sequences in the parser Jun 21, 2015

BurntSushi added enhancement question labels Jun 21, 2015

BurntSushi mentioned this issue Apr 13, 2016

Allow \<newline> to escape the newline away #107

Closed

BurntSushi closed this as completed Dec 29, 2016

KoviRobi mentioned this issue Feb 2, 2023

consider providing \< and \> #469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

be more liberal with escape sequences in the parser #93

be more liberal with escape sequences in the parser #93

kzvezdarov commented Jun 21, 2015

BurntSushi commented Jun 21, 2015

kzvezdarov commented Jun 21, 2015

BurntSushi commented Jun 21, 2015

kzvezdarov commented Jun 22, 2015

BurntSushi commented Jun 22, 2015

BurntSushi commented Jun 22, 2015

kzvezdarov commented Jun 22, 2015

BurntSushi commented Mar 13, 2016

davidblewett commented Nov 16, 2016

BurntSushi commented Nov 16, 2016

davidblewett commented Nov 16, 2016 •

edited

Loading

BurntSushi commented Nov 16, 2016

davidblewett commented Nov 16, 2016

BurntSushi commented Dec 29, 2016

be more liberal with escape sequences in the parser #93

be more liberal with escape sequences in the parser #93

Comments

kzvezdarov commented Jun 21, 2015

BurntSushi commented Jun 21, 2015

kzvezdarov commented Jun 21, 2015

BurntSushi commented Jun 21, 2015

kzvezdarov commented Jun 22, 2015

BurntSushi commented Jun 22, 2015

BurntSushi commented Jun 22, 2015

kzvezdarov commented Jun 22, 2015

BurntSushi commented Mar 13, 2016

davidblewett commented Nov 16, 2016

BurntSushi commented Nov 16, 2016

davidblewett commented Nov 16, 2016 • edited Loading

BurntSushi commented Nov 16, 2016

davidblewett commented Nov 16, 2016

BurntSushi commented Dec 29, 2016

davidblewett commented Nov 16, 2016 •

edited

Loading