Use case: Lexing/Tokenization #445

kraigher · 2018-02-08T17:19:14Z

I am looking a writing a lexer/tokenizer using this crate. It seems RegexSet is not quite what I need since it matches all regexes in parallel. For a lexer/tokenizer I would want to specify regexes in priority order and avoid matching a lower priority regex if a higher priorty regex is already matched. Consider the following simple example:

let tokenizer = regex::RegexTokenizer::new([
    "\"(.*?)\"",
    "[a-zA-Z_]+"
    "-?[1-9][0-9]*"])

// NOTE: My specific use case requires stateful lexing 
//       where I need to switch regex based on previous token 
//       so I do not use an iterator over matches
let mut start = 0;
while start < code.len() {
  match tokenizer.match(code[start..]) {
     Some(m) => { 
         match m.index {
             0 => println!("string = \"{}\"", m.get(1));
             1 => println!("identifier = \"{}\"", m.get(0));
             2 => println!("integer = \"{}\"", m.get(0));
         };
         start += m.len()
      };
     None => {
         println!("error");
         break;
     };
  };
};

So my question: Is there anything the regex create can do to make this use case easier and more well performing? Currently I build one single normal regex using | (or operator) and (?P) for each token type followed by long list of if-statements checking captures.name("nameX"). Is there a better way to solve this use case using the regex crate?

BurntSushi · 2018-02-08T17:26:42Z

Sorry, but I don't understand your question here. A RegexSet matches all of the regexes and tells you which have matched. If you have a priority order, then you can just respect that and ignore any lower priority matches in favor of the highest priority match. The bit that I am confused about is why you can't abide lower priority regexes being involved in a match. You don't actually say why?

In any case, if for some reason you only want the highest priority regex to match (maybe the lower priority regexes can match a lot of text and therefore be slow?), then your solution to create an alternation is the way to go, where the highest priority regexes appear earliest in the alternation.

Saying that RegexSet is inappropriate because you are trying to use it for a lexer/tokenizer doesn't really make a lot of sense, since that's exactly one of the principle use cases for a RegexSet. So it's a really confusing claim. Instead, please provide more detail about why it isn't appropriate for the specific thing you're trying to accomplish.

kraigher · 2018-02-08T19:40:16Z

Sorry if I was unclear. I think my concerns are the following:

Is "unnecessary" work being performed to get a full parallel match compared to using priority information.
How can I efficiently get the index of the highest priority match. Maybe just call .next() on the matches iterator is what I want but does it guarantee that the indexes appear in ascending order?
Getting the length of the match in the input string (In my use case start position is always 0 so that is not interesting). It seems I can only get the index of the regex in the regexset that matched but not the length of the match or position of it within the input string.

BurntSushi · 2018-02-08T20:51:49Z

Is "unnecessary" work being performed to get a full parallel match compared to using priority information.

If you ask for all possible matches, then the regex engine will do enough work to confirm or deny each match. If your tokens are reasonably exclusive, then the amount of extra work here may be negligible. I'd strongly recommend that you benchmark this and decide for yourself.

How can I efficiently get the index of the highest priority match. Maybe just call .next() on the matches iterator is what I want but does it guarantee that the indexes appear in ascending order?

Yes, as mentioned in the documentation.

Getting the length of the match in the input string (In my use case start position is always 0 so that is not interesting). It seems I can only get the index of the regex in the regexset that matched but not the length of the match or position of it within the input string.

The documentation discusses these limitations and the best work-arounds given the current regex implementation.

kraigher · 2018-02-09T05:18:31Z

Thank you for your answer.

Ok I guess there is no problem here.
I cannot see that the documentation says the indexes will appear in ascending order in the matches iterator. It only says that indexes correspond to the order in the regexset which I only read as the fact that indexes 0 corresponds to the first regex etc. One could imagine that if regex 2 matched before 1 that the order would be [2, 1]. Are you saying that this is a commitment of the API to provide indexes in ascending order?
Yes so it is limitation currently. But is there plans to lift the limitation if it could improve the lexing/tokenization use case? I guess the performance must be better with a multi output DFA than running serveral DFA one after another?

BurntSushi · 2018-02-09T12:01:39Z

One could imagine that if regex 2 matched before 1 that the order would be [2, 1]. Are you saying that this is a commitment of the API to provide indexes in ascending order?

Oh interesting! I see the problem now. My bad. Yes, this is definitely a documentation bug. Indices will be emitted in ascending order. I guess those docs belong on the SetMatches::iter method.

Yes so it is limitation currently. But is there plans to lift the limitation if it could improve the lexing/tokenization use case? I guess the performance must be better with a multi output DFA than running serveral DFA one after another?

It's not like I set out to cripple RegexSet intentionally. And it's not like it's just a matter of exposing something that is currently hidden. It requires significant implementation work, and is partially a research task at this point. I've tried before and failed. I maintain this project in my free time, so I can't answer you with commitments.

kraigher · 2018-02-09T14:28:56Z

It's not like I set out to cripple RegexSet intentionally. And it's not like it's just a matter of exposing something that is currently hidden. It requires significant implementation work, and is partially a research task at this point. I've tried before and failed. I maintain this project in my free time, so I can't answer you with commitments.

I fully understand, I run an open source project in my free time as well. Just wanted to know the intentions/roadmap in this area.

BurntSushi · 2018-02-09T14:36:29Z

@kraigher OK, understood. The best timeline I can give you: "years, if ever."

BurntSushi added the bug label Feb 9, 2018

BurntSushi closed this as completed in d0ab70f Apr 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use case: Lexing/Tokenization #445

Use case: Lexing/Tokenization #445

kraigher commented Feb 8, 2018 •

edited

Loading

BurntSushi commented Feb 8, 2018

kraigher commented Feb 8, 2018

BurntSushi commented Feb 8, 2018

kraigher commented Feb 9, 2018 •

edited

Loading

BurntSushi commented Feb 9, 2018 •

edited

Loading

kraigher commented Feb 9, 2018

BurntSushi commented Feb 9, 2018

Use case: Lexing/Tokenization #445

Use case: Lexing/Tokenization #445

Comments

kraigher commented Feb 8, 2018 • edited Loading

BurntSushi commented Feb 8, 2018

kraigher commented Feb 8, 2018

BurntSushi commented Feb 8, 2018

kraigher commented Feb 9, 2018 • edited Loading

BurntSushi commented Feb 9, 2018 • edited Loading

kraigher commented Feb 9, 2018

BurntSushi commented Feb 9, 2018

kraigher commented Feb 8, 2018 •

edited

Loading

kraigher commented Feb 9, 2018 •

edited

Loading

BurntSushi commented Feb 9, 2018 •

edited

Loading