Matching inputs backwards #935
Replies: 4 comments 4 replies
-
Hiya! There is indeed an optimization for regexes that end with Lines 1470 to 1472 in 9330ea5 This is the logic that determines whether it should be used or not: Lines 1402 to 1425 in 9330ea5 Basically, the only way it doesn't get used is if at least one of the following is true:
And that's it, really. Otherwise, if there's a
There is no explicit API and probably never will be. The backwards bit is a necessary aspect of how the start of a match is discovered when doing a DFA search. But the backwards stuff is only implemented for the DFA, because the NFA regex engines don't need it. Since the DFA can't always be used, providing a backwards API requires implementing it for the NFA engines too, otherwise it would not be able to work in some cases. Aside from the implementation work needed, a backwards API is a pretty niche thing and not worth putting into a general purpose But... I am working on a lower level But it's still subject to the restrictions above. There aren't any backwards APIs for the NFA engines, just for the DFA engines. It's also worth mentioning that in the case of Also, AIUI, Hyperscan has start-of-match reporting. Does it just not work? |
Beta Was this translation helpful? Give feedback.
-
Thank you @BurntSushi! Very informative. I will take a closer look and ensure that Another hypothesis I have as to why some inputs end up spending ~80% of total time in regex matching is that certain patterns used in Nosey Parker end up producing many overlapping matches. Hyperscan has all-matches semantics rather than longest-match semantics, and so a pattern that ends with a quantifier is going to produce tons of candidate matches to be run through the second-stage matcher.
Hyperscan fails to compile several of the ~60 patterns currently in Nosey Parker when using its |
Beta Was this translation helpful? Give feedback.
-
If you can give me instructions for how to reproduce your benchmark where 80% of the time is in the regex crate, I might be able to give you stronger guidance. :) |
Beta Was this translation helpful? Give feedback.
-
@BurntSushi Is there any way to determine programmatically if the |
Beta Was this translation helpful? Give feedback.
-
Hi there!
I have a regex matching internals question and optimization opportunity.
tl;dr
Can the
regex
crate explicitly match inputs backward?Alternatively, does the crate's matching implementation implicitly optimize around
$
anchors in patterns to run the FSM backward?Is there some other way to achieve this?
Details
I'm building an application, Nosey Parker, that uses regex matching to detect misplaced secrets in textual data. It currently has a set of about 60 regex-based rules it uses for detection.
Nosey Parker uses a two-stage matching engine for performance. It first uses Hyperscan to search for all of its 60 regexes simultaneously. Hyperscan is very fast, but it doesn't support capture groups, and it doesn't reliably give start-of-match information. So after using Hyperscan for initial matching, Nosey Parker uses the
regex
crate to do a second stage of regex matching.In the second stage of regex matching, a modified version of the relevant pattern with a
$
anchor appended is used to match against the input, sliced to the end-of-match offset obtained from Hyperscan. This ensures that the match we get fromregex
ends at the same end-of-match that Hyperscan gave, but lets us get precise start-of-match information and capture groups.Though Nosey Parker is already quite fast compared to similar tools (it can scan 100GB of Linux Kernal history in less than 5 minutes on a laptop), I have been looking at some performance profiles, and it looks like the second-stage matching with
regex
accounts for something like 80% of total application runtime on some larger inputs.Clearly, the way Nosey Parker currently does its second-stage matching looks like it would be quadratic in the number of times it's run on an input: if Hyperscan reports "pattern
N
matches ending at offsetM
", the$
-anchored pattern is matched against the entireinput[..M]
. Naively, this looks like it would require scanningM
bytes, even if the length of actual matching content is much less thanM
.But is the
regex
matching engine able to use the fact that the pattern ends with$
to allow matching the input backward, thus potentially avoiding having to inspect allM
bytes of the input slice?If
regex
cannot do the kind of backwards-searching optimization I'm hoping for, is there a way of explicitly doing this? I have looked through a bunch of its code and it doesn't seem to be an already available API, to allow backward searching. I thought of implementing this transformation by building on top ofregex-syntax
, but before I go down that route I figured I would ask for suggestions.Is there some other way of reworking Nosey Parker's second stage matcher to avoid having to look at all
M
bytes of the input slice, aside from implementing backward matching support forregex
?Cheers,
Brad Larsen
Beta Was this translation helpful? Give feedback.
All reactions