Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

be more liberal with escape sequences in the parser #93

Closed
kzvezdarov opened this issue Jun 21, 2015 · 14 comments
Closed

be more liberal with escape sequences in the parser #93

kzvezdarov opened this issue Jun 21, 2015 · 14 comments

Comments

@kzvezdarov
Copy link

A bit of background: I'm trying to implement the BrowserScope user agent parser as seen here: https://github.com/ua-parser/uap-core/. The regular expression below can be found here: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L17

The crate (0.1.38) encounters an error compiling the following regex:

/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?

with the following error:

Syntax(Error { pos: 92, surround: "p)[ \\-](\\d", kind: UnrecognizedEscape('-') })'

To reproduce:

use regex::Regex;
let re = Regex::new(r"/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?").unwrap();
@BurntSushi
Copy link
Member

This doesn't look like a panic to me---it looks like the parser is correctly producing an error. (If the parser is panic'ing, then that is a very different kind of bug!)

I think the parser is just being overly strict in the escape sequences that it allows. If you try using [- ] instead, does that work?

@kzvezdarov kzvezdarov changed the title Unexpected panic with UnrecognizedEscape('-') Unexpected error with UnrecognizedEscape('-') Jun 21, 2015
@kzvezdarov
Copy link
Author

Sorry, I got my terminology mixed up - I meant that the parser returns an error when I expected it not to return an error.

[- ] does work. Is there a way to relax the strictness?

Edit: Encountered a few more escapes and flags that are not getting through:

Unrecognized escape sequence: '\/'.
Unrecognized flag: '!'. (Allowed flags: i, s, m, U, x.)
Unrecognized escape sequence: '\ '.

@BurntSushi BurntSushi changed the title Unexpected error with UnrecognizedEscape('-') be more liberal with escape sequences in the parser Jun 21, 2015
@BurntSushi
Copy link
Member

The parser is conservative with escape sequences. They only work if it's a defined escape sequence (like \b) or if they are escaping characters that are interpreted specially by the regex. See the source here for all the details.

By that description, it does certainly seem like - should be escapeable, since it has special significance in character classes.

But we could be more liberal. For example, none of !, / and ever have special significance in a regex (for this crate). But it still might be convenient to permit any character to be escaped, even if it doesn't have to be.

My inclination here is to emulate other common regex parsers. Are they liberal with escapes too? If so, then I think that's good enough reason for us to be too. (I don't have time to do a survey right this minute, but I suspect they are.)

@kzvezdarov
Copy link
Author

I don't know enough about the various regex implementations to give an opinion, but here is the behavior that I've encountered.

As I mentioned I've been working on implementing a user-agent parser in Rust using the BrowserScope regexes found here: https://github.com/ua-parser/uap-core. The base regexes are used in a lot of implementations, across different languages (see the project list here: https://github.com/ua-parser), so that case might be a good base for getting an idea of what the common behavior is.

Here are a few cases in which the crate's behavior differs from the behavior of the other regex implementations that parse the same input:

  1. Escaping -:
    e.g. (Obigo)\-Browser near igo)\-Brow, character offset 8.
  2. Escaping /:
    e.g. (Sony)(?:BDP\/|\/)?([^ /;\)]+)[ /;\)] near :BDP\/|\/), character offset 13.
  3. Does not recognize flag !:
    e.g. Windows Phone [^;]+; (?:(?!IEMobile/).)*IEMobile/[^;\)]+[;\)] ?(?:ARM; ?Touch; ?|Touch; ?|WpsLondonTest; ?)?([^;]+); *([^;,\)]+) near (?:(?!IEMo, character offset 26.

Other regex implementations accept the above usage. From skimming the Python regex syntax it does look like escaping is reserved for special characters only, including -, /. Lastly, negating grouping flags (as in case 3) is not yet supported by the crate?

@BurntSushi
Copy link
Member

Lastly, negating grouping flags (as in case 3) is not yet supported by the crate?

It probably never will be. I'm not aware of any performant linear time algorithm that supports arbitrary look-ahead/look-behind. (This crate guarantees linear time worst case performance.)

I think making the parser a little more liberal is reasonable, but expecting all regex implementations to accept a common syntax is probably unrealistic. For example, capture group naming and the way flags are passed to regexes vary greatly in the wild in my experience.

@BurntSushi
Copy link
Member

@kzvezdarov I'm kind of curious how the Go implementation is working, since it also doesn't support arbitrary lookahead: http://play.golang.org/p/KkJw-Y2HYh --- The implementation linked should panic if it sees such a regexp.

@kzvezdarov
Copy link
Author

@BurntSushi I see, I didn't think about the time complexity of look-ahead/behind. In any case I was referring mostly to common behavior when it comes to escaping sequences, not to having a general common syntax.
The Go implementation does not support arbitrary lookahead. Their submodule is using the 0.4 release from Nov 2014.

@BurntSushi
Copy link
Member

One way to fix this, as mentioned, is to simply allow escaping of any character. If the character doesn't have special significance, then the escape is a no-op.

The problem with that approach is that it prevents us from adding any new escape sequences to the syntax.

An alternative that might satisfy all constraints is to pick some set of characters that are typically special among regex engines (e.g., ! would be one), and permit no-op escaping for those. Other escape sequences would be disallowed as they are today, which gives us some wiggle room for growth. (Hopefully not necessary, but leaving that door open seems wise.)

@davidblewett
Copy link
Contributor

Chiming in here with an example from Python. While testing my Python binding, I hit this:

In [29]: rure.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
---------------------------------------------------------------------------
RegexSyntaxError                          Traceback (most recent call last)
<ipython-input-29-68ebcfd850bc> in <module>()
----> 1 rure.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/__init__.py in compile(pattern, flags, **options)
     14 
     15 def compile(pattern, flags=0, **options):
---> 16     return RegexObject(pattern, flags=flags, **options)
     17 
     18 

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/regex.py in __init__(self, pattern, flags, **options)
     47                     self.rure_flags = self.rure_flags | rure_flag
     48 
---> 49         self._rure = Rure(self.pattern, flags=self.rure_flags, **options)
     50 
     51         names = [name for name in self.capture_names()]

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/lib.py in __init__(self, re, _pointer, flags, **options)
     99                 len(re),
    100                 flags,
--> 101                 self._opts
    102             )
    103         else:

/home/dblewett/.virtualenvs/rf-pypy/site-packages/rure/lib.py in checked_call(fn, err, *args)
     46         return res
     47     elif msg.startswith(b'Error parsing regex'):
---> 48         raise exceptions.RegexSyntaxError(msg)
     49     elif msg.startswith(b'Compiled regex exceeds size limit'):
     50         raise exceptions.CompiledTooBigError(msg)

RegexSyntaxError: Error parsing regex near '.exe\" /c' at character offset 50: Unrecognized escape sequence: '\"'.

In contrast, the Python stdlib accepts this:

    In [30]: re.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')
    Out[30]: re.compile(ur'appdata\\local\\google\\update\\googleupdate\.exe\" /c')

From its documentation:

'\'
Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.

@BurntSushi
Copy link
Member

The Python docs seem under specified, since " is neither a special character nor a special sequence.

@davidblewett Could you explain what you expect to happen?

@davidblewett
Copy link
Contributor

davidblewett commented Nov 16, 2016

In this case, it would be sufficient for my purposes to allow escaping of ". I also just hit an escaped /, :, &, %, '.

However, I have a feeling there are several more instances in the codebase I'm testing that escapes characters that don't technically need to be. Python seems much more liberal on just ignoring escapes that don't change the semantics of the expression.

@BurntSushi
Copy link
Member

Python seems much more liberal on just ignoring escapes that don't change the semantics of the expression.

While I might be OK with selectively escaping certain characters (even if they aren't special), I am definitely firmly opposed to a liberal policy.

@davidblewett
Copy link
Contributor

While I might be OK with selectively escaping certain characters (even if they aren't special), I am definitely firmly opposed to a liberal policy.

Yeah, I agree with that. It means auditing our current expressions for over-escaping but that isn't a huge deal. I will probably add a note to the Python binding mentioning the more strict rules. The error message is pretty clear about what needs to be fixed.

@BurntSushi
Copy link
Member

OK, I'm going to close this. I understand that other regex engines are more liberal and therefore it might be initially more painful to port regexes because of over escaping, but there are two pretty big mitigating factors here:

  1. As @davidblewett said, the error message is clear and it's easy to fix.
  2. Rejecting invalid escapes leaves room for future expansion.

Aesthetically, it also seems nice to reject superfluous escapes because... who wants more escapes than what they actually need? :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants