Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "regex" expression that returns an array of match groups or… #6228

Closed
wants to merge 4 commits into from
Closed

Implement "regex" expression that returns an array of match groups or… #6228

wants to merge 4 commits into from

Conversation

mkv123
Copy link

@mkv123 mkv123 commented Feb 25, 2018

… null if no match.

Implements a "regex" expression (#4089 ) which takes two or three string parameters: First is the regular expression to use, second (optional) is flags and third is the source string. Also added some basic tests for the regex feature.

As this returns the match groups, it works for both checking for matches or search and replace type of functionality (#4100 ) by using let, array access and concat operations.

As on failure this returns null, it works nicely with coalesce, in case you need to specify multiple different cases of regexes.

@anandthakker
Copy link
Contributor

Thanks for getting this started, @mkv123! Some initial thoughts/questions:

  • Once we resolve the question of exactly what RegExp synatx to support, we'll need to validate input against that subset.
  • Should the regexp itself be allowed to be an expression, or should we require it to be a literal value? I lean towards the latter, since that would allow us to precompile the regexp at parse time.
  • While the more general match-group-returning form could technically be used for things like boolean tests or search/replace, I think it would be nice to have specific expressions tailored for these very common, simple use cases.

cc @mapbox/gl

@mkv123
Copy link
Author

mkv123 commented Feb 26, 2018

Hi Anand and thanks for the input.

  1. "...exactly what RegExp synatx to support..." : Are there thoughts on what the best way to do this validation would be without implementing a full regex parser in javascript & native? Would documentation instead of enforcement be an possible approach? There are already documented differences between native and js.

  2. "should the regexp be an expression": While I don't see a tremendous need right now for the flexibility of allowing the regexp to be an expression, as a user I would find it a surprising limitation. Also many language implementations, including javascript engines, cache the compiled expressions internally, yielding practically no performance improvement from having the expression precompiled vs dynamic. Should regex compile time become a performance bottle neck, similar caching could be implemented in native. Imho, there should be a measurable benefit for adding an extra limitation, and I'm not entirely sure that would be the case here.

  3. "it would be nice to have specific expressions": How about "regex-search", "regex-match" and "regex-replace" to return boolean, array of matches and replaced strings respectively?

@1ec5
Copy link
Contributor

1ec5 commented Feb 27, 2018

Also many language implementations, including javascript engines, cache the compiled expressions internally, yielding practically no performance improvement from having the expression precompiled vs dynamic.

This is the case for ICU, which underlies the most obvious regular expression implementations that mbgl or the native SDKs would use.

@mkv123
Copy link
Author

mkv123 commented Feb 27, 2018

Updated the pull request to use "regex-match", "regex-test" and "regex-replace" for the three different uses. Added basic tests for them as well.

As regex replace is a string method in javascript and in ICU, I'm not entirely sure what would be the best order for the "regex-replace" arguments. I do think the regular expression string should be the first argument on a regex operation. Current order in implementation is expression, flags (optional), replacement string, original string, which happens to match the order for python's re.sub() for example.

I've looked at the regex standards and hunted for a nice comparison between them to establish what the differences in supported syntax are, but haven't really been able to find one.

For reference:

@1ec5
Copy link
Contributor

1ec5 commented Feb 27, 2018

I've looked at the regex standards and hunted for a nice comparison between them to establish what the differences in supported syntax are, but haven't really been able to find one.

This table on Wikipedia is a good starting point. For the purposes of map styling, I think the most notable differences are the lack of (fixed-width) lookbehind, named capture groups, or Unicode character classes in JavaScript regular expressions. For example, \w matches only [A-Za-z0-9_] in JavaScript but all analogous Unicode codepoints in ICU ([\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]).

JavaScript also has a hack in which [A-Z] matches the Turkish İ.

/ref #4089 (comment)

@anandthakker
Copy link
Contributor

I think detecting and preventing the use of syntactic features that aren't supported by both ICU and JS is probably okay, but it's less clear to me what we can/should do about the differing behavior for things like \w that @1ec5 mentions in #6228 (comment).

In terms of API design, how about:

  • ["regexp", string, flag1, flag2, ...] (the flags being optional), produces a value of a new type regexp. E.g., ["regexp", "[a-z0-9]+", "i"]
  • Search with ["match", haystack: string, needle: string]: boolean or ["match", haystack: string, needle: regexp]: false | array<string> (so that either could be used as a boolean)
  • Replace with ["replace", src: string, query: string | regexp, replacement: string], with the replacement expression evaluated with a ["group", number] expression available. E.g.:
["replace",
    "philadelphia, pa",
    ["regexp", "(\w+),\s*(\w+)"],
    [ "concat", ["group", "0"], " ", ["upcase", ["group", "1"]] ]
]

@jfirebaugh
Copy link
Contributor

jfirebaugh commented Mar 12, 2018

match is already used, and in general, I think we ought to start namespacing related groups of functionality. How about a family of operators, "regexp-match", "regexp-search", "regexp-replace", etc.?

Of course, it's not clear exactly what each of those should do. There's a number of possible use cases, and different languages make differing choices about which are conveniently built in, or which names to use. I surveyed JS, C++ <regex>, and Racket (as a lispy representative):

  JS C++ Racket
Is there any match? RegExp.prototype.test Not built in regexp-match?
How many matches? /g, iterated RegExp.prototype.test Not built in Not built in
Does the entire string match? Not built in std::regex_match regexp-match-exact?
Get only position of first match String.prototype.search Not built in regexp-match-positions
Get only positions of all matches Not built in Not built in regexp-match-positions*
Get first match and captures String.prototype.match or RegExp.prototype.exec std::regex_search regexp-match
Get all matches and captures /g, iterated RegExp.prototype.exec std::regex_iterator regexp-match* (with match-select)
Split on matches String.prototype.split Not built in regexp-split
Replace first match String.prototype.replace std::regex_replace with std::regex_constants::format_first_only regexp-replace
Replace all matches /g, String.prototype.replace std::regex_replace regexp-replace*

Other random notes:

Note that a full-fidelity match result is array<string | null>, not simply array<string>. This reflects the fact that capture groups may not participate in the match at all; in JS, the result of /a|(b)/.exec("a") is ['a', undefined].

For replace, I'm not fond of "group", as it would require introducing regexp-specific state to the evaluation stack and we'd need to validate that no "group" expressions appear outside a "replace" context. I'd prefer to:

  • Introduce first-class functions, so the replacement can be represented functionally
  • Use the standard $n meta-syntax. (This is functionally the same as "group" but the implementation would be more localized.)

The first is preferred, since it's a generally useful feature, and we try to avoid meta-syntaxes.

@1ec5
Copy link
Contributor

1ec5 commented Mar 12, 2018

For completeness, here’s a corresponding table for the various regular expression–related APIs in Objective-C and Swift:

Regular expression APIs in Objective-C and Swift
NSRegularExpression NSString NSPredicate
Is there any match?
How many matches? -[NSRegularExpression numberOfMatchesInString:options:range:]
Does the entire string match? MATCHES
Get only position of first match -[NSRegularExpression rangeOfFirstMatchInString:options:range:] -[NSString rangeOfString:options:] with NSRegularExpressionSearch
Get only positions of all matches
Get first match and captures -[NSRegularExpression firstMatchInString:options:range:]
Get all matches and captures -[NSRegularExpression matchesInString:options:range:]
or -[NSRegularExpression enumerateMatchesInString:options:range:usingBlock:]
Split on matches
Replace first match -[NSRegularExpression replacementStringForResult:inString:offset:template:]
Replace all matches -[NSRegularExpression stringByReplacingMatchesInString:options:range:withTemplate:] -[NSString stringByReplacingOccurrencesOfString:withString:options:range:] (with NSRegularExpressionSearch)

@anderslarssonvbg
Copy link

I tried regex-test in a filter and it works very well, great work! Hope this will be added to a standard release asap.

One thing I'm missing is the possibility to make the regex case insensitive with the 'i' flag in RegExp()

@anandthakker
Copy link
Contributor

  • Introduce first-class functions, so the replacement can be represented functionally
  • Use the standard $n meta-syntax. (This is functionally the same as "group" but the implementation would be more localized.)

The first is preferred, since it's a generally useful feature, and we try to avoid meta-syntaxes.

I think we should hold off on first-class functions until we've seen one or two more concrete use cases where we'd want to use them. (I can hypothesize such use cases, but first class functions add enough complexity that I'd want to have some confidence that they're going to happen in practice.)

Agreed on ideally avoiding a meta-syntax (though at least this one would be a pretty well-established one).

I'm not fond of "group", as it would require introducing regexp-specific state to the evaluation stack and we'd need to validate that no "group" expressions appear outside a "replace" context.

A variation that would address the former, but not the latter, issue would be to have ["regexp-replace", pattern, haystack, needle] bind $0, $1, etc. as let variables in needle's evaluation stack.

@anderslarssonvbg
Copy link

anderslarssonvbg commented Mar 28, 2018

I have tried regex-test a bit more as a filter, and it works well when used as the only filter, e.g.:

filter: [ "all", ["regex-test", "VN-*", ["get", "property1"] ]

but when adding more than 1 filter, e.g:

filter: [ "all", ["regex-test", "value*", ["get", "property1"]], ["==", "property2", "value2"] ]

it breaks with the following error:

evented.js:121 Error: layers.id-b233f807-e131-4b3a-a15e-6d03a5346d40.filter[1][0]: expected one of [==, !=, >, >=, <, <=, in, !in, all, any, none, has, !has], "regex-test" found
at Function.module.exports.emitErrors (validate_style.js:12)
at t._validate (style.js:888)
at t.addLayer (style.js:534)
at e.addLayer (map.js:1171)
at eval (MapBase.vue?2507:92)
at new Promise ()
at new F (_export.js?90cd:35)
at VueComponent.addLayerAsync (MapBase.vue?2507:91)
at Object.boundFn [as addLayerAsync] (vue.esm.js?efeb:189)
at eval (LayerSettings.vue?4a1a:499)

SOLUTION:

Added

"regex-test": { "doc": "" }

in filter_operators in src/style-spec/reference/v8.json

@stdmn
Copy link

stdmn commented Nov 15, 2018

Any recent movement on this? It would be a very valuable addition.

@kkaefer
Copy link
Contributor

kkaefer commented Oct 15, 2019

I'm going to close this PR because it's stale. Let's continue the discussion in the original ticket #4089.

@kkaefer kkaefer closed this Oct 15, 2019
@anderslarssonvbg
Copy link

Too bad, it's a perfectly working solution (with my above addition) which I have to manually merge in to a forked version every time a new version of Mapbox is released.

@kkaefer
Copy link
Contributor

kkaefer commented Oct 15, 2019

@anderslarssonvbg please read this ticket, and #4089 (comment) for rationale.

@mapbox mapbox locked as resolved and limited conversation to collaborators Oct 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants