-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegexSet does not benefit from perf-literal #865
Comments
Overall, I think you expect way too much of If you do have all literal patterns then I would encourage you to just use the
Which docs? I don't think there are any docs that discuss the Aho-Corasick optimization or give any guarantees to that effect. (I see that the Also, if you do add a Now, if all of your patterns aren't literals but they do all have meaningful literal prefixes, then that is another avenue that could be sped up by literal optimizations.
This crate just does not have the sophistication to handle this kind of workload. It may handle it a little better some day, but the fully general case is unlikely to improve much. You might instead consider Hyperscan, which is best in class for this domain. I'll leave this open though because there are some improvements that can be made here (including your suggestion to use Aho-Corasick when you just have a bunch of literals), but I'm blocking it on #656. |
Also, FWIW, the case of "lit1|lit2|...|litN" deferring to Aho-Corasick completely was an optimization I added a while back to try and make a common case of ripgrep faster. It was the end result of modifying the Aho-Corasick algorithm to support leftmost-first matching. (Without that, standard Aho-Corasick won't do the same thing as a regex engine, and thus you can't just use it in place of the regex engine so simply.) I did it in a pretty non-robust way, which is why it only works for a single pattern. A lot about this specific optimization could be improved, including applying it to more cases and to So overall there is a lot to rethink here and that's part of why I'm working on #656. Reasoning about these optimizations is difficult and the current internal infrastructure isn't set up to be composable at all. The idea with #656 is to lower the barriers of composition and hopefully give more room to experiment and implement optimizations like this. |
I thought there was a decent chance it wouldn't work going in, so I'm not entirely surprised.
That sort of intertwining is what I was hoping would happen, I think. My actual problem involves matching against a set of strings containing format specifiers, to determine which string was used and extract the value(s) formatted into them. Fortunately, the placeholders almost all represent numbers, or else one of a few hundred fixed strings. What I've ended up doing instead is breaking up my patterns into "literal fragments" (i.e. splitting them where the format specifiers are), feeding them into This seems to work, and the state size is small enough that I can use |
It sounds like the original problem here has been worked around. Otherwise, I'm not sure there's anything actionable here for the |
What version of regex are you using?
1.5.6
Describe the bug at a high level.
When building a
RegexSet
, with theperf-literal
feature enabled,aho-corasick
isn't used to match literals from different input patterns, even when all input patterns are literals.From the discussion in #822 I expected that a
RegexSet
would behave identically to aRegex
composed of the same patterns joined with|
, except for the caveats re: capture groups discussed there.What are the steps to reproduce the behavior?
What is the actual behavior?
Inspecting
re_set
andlone_re
in a debugger shows that in the former,ac == None
, while in the latter it'sSome(...)
.Further, while testing this, I learned that including a single non-literal pattern (in my case,
\d+
) in thesome_literals
list prevent it from being used at all. While that might be implied by the documentation, it surprised me.note:
I ended up at this cursed juncture while trying to understand why a
RegexSet
I'm building in another project was using around 1.2 GB of RAM to match 12k regexes consisting of around 700kb of literal text with about 10 thousand number-matching capture groups interspersed.The text was updated successfully, but these errors were encountered: