Rework regular expression parsing/compiling to be more lenient #74

svaarala · 2014-11-13T06:50:55Z

Duktape's regular expression parser/compiler conforms strictly to the E5/E5.1 regular expression syntax, and rejects many regular expressions that are leniently allowed by other engines. It's confusing to users that other engines parse several regexp forms that Duktape (technically correctly) rejects as a SyntaxError.

Examples of technically invalid regexps that work with most engines:

/^\_/: underscore escape is not allowed, underscore is an IdentifierPart and identity escapes are not allowed for identifier part characters.
/\$/: regexp escape is not allowed, dollar sign is an IdentifierPart. Duktape already has a fix to allow a dollar escape because it's so common.
/\\{/: unescaped brace is not allowed (unless it is part of a quantifier like /x{5}/).
/[\2]/: non-zero decimal escape is not allowed (E5 Section 15.10.2.19) but accepted by other engines.

Because these regexps are quite widely used, Duktape should probably parse regexps a bit more leniently. This change is non-trivial because lenient regexp parsing needs backtracking which doesn't fit easily into the current regexp parser/compiler code. For instance, when parsing /{foo}/, Duktape will see the left curly brace and think it's parsing a quantifier. When it fails to parse, it currently throws a SyntaxError. Instead, it should rewind and assume { is meant literally, i.e. parse the regexp as /\{foo\}/.

Related issues with notes on various failing regexps:

Tasks:

Figure out what leniency is necessary for "real world" compatibility with other engines. The main goal is to avoid unnecessarily surprising users.
Make the necessary changes to Duktape for more lenient parsing. The changes should be made with code footprint in mind (not growing the footprint if possible). There should be an option to restore compliance, if possible. There's probably no need to have an option to enable/disable individual leniency cases, just a lenient/compliant option.
Update testcases.
Update internal documentation (doc/regexp.rst) to reflect the changes in the regexp algorithm.
Update external documentation. Document the non-compliant regexp forms allowed.

The text was updated successfully, but these errors were encountered:

mitchblank · 2014-11-14T08:20:51Z

Not directly related, but just wanted to plant an idea in your head.

In my codebase I already have a regex engine available -- namely PCRE with some optimizations layered on top. Modern PCRE even support things like JIT compilation.

I know that pcre and ecmascript regexes aren't identical but pcre is pretty close to being a superset, so if it were possible to bring my own regex engine I probably would.

When I was doing some investigations of mruby, that was one thing I really liked. The language's parser would find the regex, but it would then just hand off the string to a constructor for the ruby Regex class. By putting your own implementation of that class in the global scope it was very easy to seamlessly wire in your own engine.

I don't consider this a high-priority item at all, but if there was an API for me to pass in a few function pointers and slide in my own versions of duk__regexp_match_helper() and duk_regexp_create_instance() (and make the bytecode just emit the raw regex into the bytecode) I would probably do so. Maybe others would find that useful as well, not sure.

svaarala · 2014-11-14T09:00:06Z

I agree that replacing the regexp engine would be very nice. The built-in engine in Duktape is optimized for compactness which leads to necessary compromises on other features (including performance). I think there's already a Ditz issue for this but it's been low priority.

But now that you mention it, it might make sense to keep Duktape's built-in engine strictly compliant, but make it easy to plug in external regexp engines. These can then be lenient and provide a superset of features when that makes sense.

As for the required hooks, another very straightforward approach is simply to replace the RegExp constructor entirely. RegExp literals need special handling because they're compiled during Ecmascript code compilation and only instantiated during execution. A few hooks would be needed for this.

mitchblank · 2014-11-14T09:36:55Z

Of course the compiler would still need to lex the regex enough to find its terminating '/' character so it would have to interpret the regex enough to see the backslash escapes etc. It just would skip doing the bytecode generation part.

svaarala · 2014-11-14T10:19:50Z

Yes, that's what I mean: the compiler would parse the regexp literals (and their flags) but would consult a hook to compile it.

svaarala · 2016-01-27T14:43:35Z

By the way, ES6 has optional support for literal braces in Section B.1.4:

http://www.ecma-international.org/ecma-262/6.0/#sec-regular-expressions-patterns

This doesn't extend to unquoted square brackets etc though.

fatcerberus · 2016-01-27T15:39:43Z

Nice, so we can say the curly brace support is part of "some features borrowed from Ecmascript E6" ;)

svaarala · 2016-01-27T18:26:33Z

Yes, and as a technical detail the config option shouldn't say "DUK_USE_NONSTD_..." but "DUK_USE_ES6_...".

See: svaarala/duktape#74

Escape '}' in regular expression for better compatibility with other JavaScript interpreters. It seems like `Ducktype` is not able to properly parse `brace-expansion` due to the unescaped curly brace. See: svaarala/duktape#74

mathiasbynens · 2016-07-19T14:19:52Z

Another example: /]/ — see slevithan/xregexp#141. (Since portability is one of Duktape’s goals you’ll want to support this.)

Escape '}' in regular expression for better compatibility with other JavaScript interpreters. It seems like `Ducktype` is not able to properly parse `brace-expansion` due to the unescaped curly brace. See: svaarala/duktape#74

…tape#74 )

svaarala mentioned this issue Nov 13, 2014

Regular expression parse error - /^\_/ and /\\{/g #69

Closed

svaarala mentioned this issue Nov 14, 2014

Make regular expression engine pluggable #77

Open

svaarala added compliance enhancement labels Nov 14, 2014

This was referenced Dec 6, 2014

Test CoffeeScript compiler judofyr/duktape.rb#2

Closed

Regexp parse error - [^[]] #86

Closed

svaarala added the realworld label Dec 6, 2014

This was referenced Feb 25, 2015

Regexp parse error - /[\0]/ #122

Closed

Escape literal [ in regexp jashkenas/coffeescript#3885

Merged

svaarala mentioned this issue Mar 20, 2015

Duktape chokes on literal braces in RegExp #142

Closed

svaarala mentioned this issue Apr 2, 2015

Configuring Regexp stack limits #157

Closed

svaarala mentioned this issue Oct 22, 2015

Error in RegExp #410

Closed

This was referenced Dec 22, 2015

Fix non ES5 compliant regexp mochajs/mocha#2021

Merged

Fix non ES5 compliant regexp chaijs/chai#577

Merged

crazyjul mentioned this issue Jan 5, 2016

Loosen rules for RegExp { } literals #513

Merged

myme added a commit to myme/brace-expansion that referenced this issue Feb 11, 2016

Escape RegExp brace

73dab50

See: svaarala/duktape#74

myme mentioned this issue Feb 11, 2016

Escape RegExp curly brace juliangruber/brace-expansion#16

Merged

amol- mentioned this issue Mar 1, 2016

Some compatibility fixes for working with babel 6 and some enhancements to the babel transpile runner amol-/dukpy#1

Merged

JulesAU added a commit to JulesAU/query-string that referenced this issue Feb 14, 2017

Fix non ES5 compliant regexp (allows use in Duktape; see svaarala/duk…

d7c6b39

…tape#74 )

JulesAU mentioned this issue Feb 14, 2017

Fix non ES5 compliant regexp sindresorhus/query-string#84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework regular expression parsing/compiling to be more lenient #74

Rework regular expression parsing/compiling to be more lenient #74

svaarala commented Nov 13, 2014

mitchblank commented Nov 14, 2014

svaarala commented Nov 14, 2014

mitchblank commented Nov 14, 2014

svaarala commented Nov 14, 2014

svaarala commented Jan 27, 2016

fatcerberus commented Jan 27, 2016

svaarala commented Jan 27, 2016

mathiasbynens commented Jul 19, 2016

Rework regular expression parsing/compiling to be more lenient #74

Rework regular expression parsing/compiling to be more lenient #74

Comments

svaarala commented Nov 13, 2014

mitchblank commented Nov 14, 2014

svaarala commented Nov 14, 2014

mitchblank commented Nov 14, 2014

svaarala commented Nov 14, 2014

svaarala commented Jan 27, 2016

fatcerberus commented Jan 27, 2016

svaarala commented Jan 27, 2016

mathiasbynens commented Jul 19, 2016