RegexSet: get matched strings. #352

zstewar1 · 2017-04-02T19:38:28Z

RegexSet can be used to build a tokenizer/lexer fairly easily by creating a set of regexes to match each token type:

let rs = RegexSet::new(&[
    r"^\s+", // 0
    r"^[+-]", // 1
    r"^[*/]", // 2
    r"^\(", // 3
    r"^\)", // 4
    r"^([0-9]*\.)?[0-9]+([eE][+-]?[0-9]+)?", // 5
    r"^[\p{Alphabetic}_][\d\p{Alphabetic}_]*", // 6
]);

Lexing is done by running the input string against the RegexSet and constructing a token based on which regex(es) matched (there might be multiple matches if identifiers and keywords are separate regexes, but selecting between conflicting matches is up to the developer of the lexer). However, RegexSet only supports checking if any regex matches or checking which regexes matched. For fixed-size matches, this is acceptable: if the [+-] token matches, the you know to extract just one character, for example. However, for a complex token type like a number (5) or identifier (6), the only way currently to extract the actual matched string would be to have a separate regex for just that token, and run it over the input a second time to find the match length.

This would be much simpler if RegexSet supported a mode where it could capture the complete match from each regex as a string slice or indexes into the input string. This doesn't require handling capture groups, just the complete match.

Another workaround is to try each regex in sequence, though this too is disadvantageous. The biggest issue is exactly the problem RegexSet is intended to solve: the string must be processed repeatedly, for each regex in the set. Another issue is that it makes matching certain kinds of overlapping regexes more difficult; for example, a keyword like and or nil might also be a valid identifier, so the lexer would have to first match them as a single regex and then disambiguate.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-04-02T19:45:19Z

Yes, it would be nice. If it were easy to do, then I would have done it. :-)

BurntSushi · 2017-04-02T19:45:27Z

See: #156 (comment)

BurntSushi · 2017-04-02T19:46:50Z

the only way currently to extract the actual matched string would be to have a separate regex for just that token, and run it over the input a second time to find the match length.

Indeed, that is the recommended path to take.

zstewar1 · 2017-04-02T20:24:21Z

Ok, so to check my understanding of the internals of [ND]FA matching:

It seems like it should be fairly easy to detect where the end of a match is, since you would always know what index into the string you are at when you hit a terminal symbol, and could update the terminal index for that regex to the current input index.

So if you have a matched: Vec<bool> the length of the number of regexes, whenever you hit a terminal symbol for some regex i, you would update matched[i] = true. Then that would seem to be extensible to matched_end: Vec<Option<usize>> where whenever you hit a terminal for regex i at index k in the source string, you would update matched[i] = Some(k).

Is there a detail I've missed that would prevent such a strategy from being able to report the endpoints of the matched strings?

If not, I guess then would the problem be that it's hard to make it work for finding the start of each match? I'm only interested in the whole match, not captures, and only at most one match per regex, but still, I don't know if you know when you've started on any particular component regex.

Notably, for the lexer case, all the regexes are start-anchored, so only the end of each match is relevant.

Though even if end-only anchored matching is possible, I can understand if that's overly-specialized behavior that this library does not wish to support.

BurntSushi · 2017-04-02T20:34:34Z

My too short answer to your thoughtful comment is: I don't know. I haven't dug around in RegexSet land in a while, so I simply cannot remember off the top of my head whether your idea works or not. Certainly, when I was looking into it, I was only considering the full range of a match and not just the ending locations. I cannot immediately think of a reason why your idea wouldn't work.

Though even if end-only anchored matching is possible, I can understand if that's overly-specialized behavior that this library does not wish to support.

If it can be done with a small amount of fuss, then I would be OK with adding it. (For the most part, this has been the life of RegexSet: it exists specifically because it can be bolted on to the existing regex engines with small but subtle modifications.)

In any case, there is precedent for returning match boundaries that are incomplete in various ways. See shortest_match for example.

zstewar1 · 2017-04-03T01:38:51Z

Looks like it would be doable, but would require nontrivial modification to how DFA is executed; didn't look at the NFA parts. Which regexes are matched is determined after execution by reverse lookup on the final state. An enum containing the index of the final position in the string is returned, but modification would be needed to determine which state each match ended in; but each state seems to only knows which of the regexes has matched already, it doesn't just record whether each regex was matched as it goes.

Even if such modifications could be made it might impact performance for the common case or require duplicating part of the code, so this is probably infeasible.

JustAPerson · 2019-08-07T21:28:47Z

Found this from google. Perhaps this could be mentioned in the docs for RegexSet?

I was also looking to create a lexer and would like this for the same reason. It's simple enough to work around by re-running the matching expressions.

BurntSushi · 2019-08-07T21:49:38Z

Could you please say why these docs (which were linked above) are insufficient? https://docs.rs/regex/0.2.1/regex/struct.RegexSet.html#limitations

JustAPerson · 2019-08-08T01:20:46Z

In honesty, I didn't see that. Perhaps explicitly mentioning there's no equivalent to Regex::find()/Regex::captures() with a bold hyper link would help lazy people like me.

Sorry for the confusion.

BurntSushi added the question label Apr 2, 2017

zstewar1 closed this as completed Apr 3, 2017

vsouleles mentioned this issue Jun 21, 2021

Consistent log and panic messages project-oak/oak#2119

Closed

BurntSushi mentioned this issue Dec 7, 2021

[Feature request]: match multiple regular expressions simultaneously obtaining capture groups #822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexSet: get matched strings. #352

RegexSet: get matched strings. #352

zstewar1 commented Apr 2, 2017

BurntSushi commented Apr 2, 2017

BurntSushi commented Apr 2, 2017

BurntSushi commented Apr 2, 2017

zstewar1 commented Apr 2, 2017

BurntSushi commented Apr 2, 2017 •

edited

Loading

zstewar1 commented Apr 3, 2017

JustAPerson commented Aug 7, 2019

BurntSushi commented Aug 7, 2019

JustAPerson commented Aug 8, 2019

RegexSet: get matched strings. #352

RegexSet: get matched strings. #352

Comments

zstewar1 commented Apr 2, 2017

BurntSushi commented Apr 2, 2017

BurntSushi commented Apr 2, 2017

BurntSushi commented Apr 2, 2017

zstewar1 commented Apr 2, 2017

BurntSushi commented Apr 2, 2017 • edited Loading

zstewar1 commented Apr 3, 2017

JustAPerson commented Aug 7, 2019

BurntSushi commented Aug 7, 2019

JustAPerson commented Aug 8, 2019

BurntSushi commented Apr 2, 2017 •

edited

Loading