-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use case: Lexing/Tokenization #445
Comments
Sorry, but I don't understand your question here. A In any case, if for some reason you only want the highest priority regex to match (maybe the lower priority regexes can match a lot of text and therefore be slow?), then your solution to create an alternation is the way to go, where the highest priority regexes appear earliest in the alternation. Saying that RegexSet is inappropriate because you are trying to use it for a lexer/tokenizer doesn't really make a lot of sense, since that's exactly one of the principle use cases for a RegexSet. So it's a really confusing claim. Instead, please provide more detail about why it isn't appropriate for the specific thing you're trying to accomplish. |
Sorry if I was unclear. I think my concerns are the following:
|
If you ask for all possible matches, then the regex engine will do enough work to confirm or deny each match. If your tokens are reasonably exclusive, then the amount of extra work here may be negligible. I'd strongly recommend that you benchmark this and decide for yourself.
Yes, as mentioned in the documentation.
|
Thank you for your answer.
|
Oh interesting! I see the problem now. My bad. Yes, this is definitely a documentation bug. Indices will be emitted in ascending order. I guess those docs belong on the
It's not like I set out to cripple |
I fully understand, I run an open source project in my free time as well. Just wanted to know the intentions/roadmap in this area. |
@kraigher OK, understood. The best timeline I can give you: "years, if ever." |
I am looking a writing a lexer/tokenizer using this crate. It seems RegexSet is not quite what I need since it matches all regexes in parallel. For a lexer/tokenizer I would want to specify regexes in priority order and avoid matching a lower priority regex if a higher priorty regex is already matched. Consider the following simple example:
So my question: Is there anything the regex create can do to make this use case easier and more well performing? Currently I build one single normal regex using | (or operator) and (?P) for each token type followed by long list of if-statements checking captures.name("nameX"). Is there a better way to solve this use case using the regex crate?
The text was updated successfully, but these errors were encountered: