Matching inputs backwards #935

bradlarsen · 2022-12-15T19:05:24Z

bradlarsen
Dec 15, 2022

Hi there!

I have a regex matching internals question and optimization opportunity.

tl;dr

Can the regex crate explicitly match inputs backward?

Alternatively, does the crate's matching implementation implicitly optimize around $ anchors in patterns to run the FSM backward?

Is there some other way to achieve this?

Details

I'm building an application, Nosey Parker, that uses regex matching to detect misplaced secrets in textual data. It currently has a set of about 60 regex-based rules it uses for detection.

Nosey Parker uses a two-stage matching engine for performance. It first uses Hyperscan to search for all of its 60 regexes simultaneously. Hyperscan is very fast, but it doesn't support capture groups, and it doesn't reliably give start-of-match information. So after using Hyperscan for initial matching, Nosey Parker uses the regex crate to do a second stage of regex matching.

In the second stage of regex matching, a modified version of the relevant pattern with a $ anchor appended is used to match against the input, sliced to the end-of-match offset obtained from Hyperscan. This ensures that the match we get from regex ends at the same end-of-match that Hyperscan gave, but lets us get precise start-of-match information and capture groups.

Though Nosey Parker is already quite fast compared to similar tools (it can scan 100GB of Linux Kernal history in less than 5 minutes on a laptop), I have been looking at some performance profiles, and it looks like the second-stage matching with regex accounts for something like 80% of total application runtime on some larger inputs.

Clearly, the way Nosey Parker currently does its second-stage matching looks like it would be quadratic in the number of times it's run on an input: if Hyperscan reports "pattern N matches ending at offset M", the $-anchored pattern is matched against the entire input[..M]. Naively, this looks like it would require scanning M bytes, even if the length of actual matching content is much less than M.

But is the regex matching engine able to use the fact that the pattern ends with $ to allow matching the input backward, thus potentially avoiding having to inspect all M bytes of the input slice?

If regex cannot do the kind of backwards-searching optimization I'm hoping for, is there a way of explicitly doing this? I have looked through a bunch of its code and it doesn't seem to be an already available API, to allow backward searching. I thought of implementing this transformation by building on top of regex-syntax, but before I go down that route I figured I would ask for suggestions.

Is there some other way of reworking Nosey Parker's second stage matcher to avoid having to look at all M bytes of the input slice, aside from implementing backward matching support for regex?

Cheers,
Brad Larsen

BurntSushi · 2022-12-15T19:23:35Z

BurntSushi
Dec 15, 2022
Maintainer

Hiya!

There is indeed an optimization for regexes that end with $. It's called DfaAnchoredReverse:

regex/src/exec.rs

Lines 1470 to 1472 in 9330ea5

    
           /// A reverse DFA search starting from the end of a haystack. 
        
           #[cfg(feature = "perf-dfa")] 
        
           DfaAnchoredReverse,

This is the logic that determines whether it should be used or not:

regex/src/exec.rs

Lines 1402 to 1425 in 9330ea5

    
           fn imp(ro: &ExecReadOnly) -> Option<MatchType> { 
        
               if !dfa::can_exec(&ro.dfa) { 
        
                   return None; 
        
               } 
        
               // Regex sets require a slightly specialized path. 
        
               if ro.res.len() >= 2 { 
        
                   return Some(MatchType::DfaMany); 
        
               } 
        
               // If the regex is anchored at the end but not the start, then 
        
               // just match in reverse from the end of the haystack. 
        
               if !ro.nfa.is_anchored_start && ro.nfa.is_anchored_end { 
        
                   return Some(MatchType::DfaAnchoredReverse); 
        
               } 
        
               #[cfg(feature = "perf-literal")] 
        
               { 
        
                   // If there's a longish suffix literal, then it might be faster 
        
                   // to look for that first. 
        
                   if ro.should_suffix_scan() { 
        
                       return Some(MatchType::DfaSuffix); 
        
                   } 
        
               } 
        
               // Fall back to your garden variety forward searching lazy DFA. 
        
               Some(MatchType::Dfa) 
        
           }

Basically, the only way it doesn't get used is if at least one of the following is true:

The DFA can't be used. This generally only occurs when you use Unicode word boundaries. So \b and not (?-u:\b). (Actually, even if you have a \b, the DfaAnchoredReverse optimization is still selected and attempted. But it may be defeated if non-ASCII is detected during the reverse search. In which case, the reverse search optimization is abandoned and a slow NFA simulation forward search is used.)
You're using a RegexSet with more than one regex. (I presume you aren't here, since those can't handle capturing groups or match offsets in the first place.)

And that's it, really. Otherwise, if there's a $ and not a ^, then the DfaAnchoredReverse optimization gets applied. So maybe it isn't in your case, and if not, it would be worth investigating why. If it is being applied, then perhaps there is some other reason why you're seeing a slow down. Without the results of a profile or a micro-benchmark, it's hard to say.

If regex cannot do the kind of backwards-searching optimization I'm hoping for, is there a way of explicitly doing this? I have looked through a bunch of its code and it doesn't seem to be an already available API, to allow backward searching.

There is no explicit API and probably never will be. The backwards bit is a necessary aspect of how the start of a match is discovered when doing a DFA search. But the backwards stuff is only implemented for the DFA, because the NFA regex engines don't need it. Since the DFA can't always be used, providing a backwards API requires implementing it for the NFA engines too, otherwise it would not be able to work in some cases. Aside from the implementation work needed, a backwards API is a pretty niche thing and not worth putting into a general purpose regex crate.

But... I am working on a lower level regex-automata library that will provide backwards APIs. See for example: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/hybrid/dfa/struct.Config.html#example-reverse-automaton-to-find-start-of-match

But it's still subject to the restrictions above. There aren't any backwards APIs for the NFA engines, just for the DFA engines. It's also worth mentioning that in the case of regex-automata, you don't even need to add the $. You can specify the bounds of the search within a bigger slice and instruct the regex engine to start its search at a particular point and not go beyond a particular offset. This would also permit you to handle look-around assertions correctly. (Which you are, I'm guessing, probably not doing correctly right now. But it may not matter if none of your 60 regexes require it.)

Also, AIUI, Hyperscan has start-of-match reporting. Does it just not work?

2 replies

bradlarsen Dec 15, 2022
Author

But it's still subject to the restrictions above. There aren't any backwards APIs for the NFA engines, just for the DFA engines. It's also worth mentioning that in the case of regex-automata, you don't even need to add the $. You can specify the bounds of the search within a bigger slice and instruct the regex engine to start its search at a particular point and not go beyond a particular offset.

Would regex-automata ever be able to support capture groups? It sounds like maybe its onepass module supports groups, but is limited in the sort of patterns it supports?

This would also permit you to handle look-around assertions correctly. (Which you are, I'm guessing, probably not doing correctly right now. But it may not matter if none of your 60 regexes require it.)

Hyperscan doesn't support arbitrary lookaround assertions, and so Nosey Parker doesn't use them.

BurntSushi Dec 15, 2022
Maintainer

You can't do capture groups with DFAs. And DFAs are the only thing that provide reverse searches. The NFAs could do reverse searches, but supporting capturing groups in the reverse search and guaranteeing you get the same results as a forward search seems pretty unlikely to me. So if you want capture groups, I'm pretty sure you are doomed to at least run a reverse search to find the start of the match and then another scan to find capturing groups.

To be clear, regex-automata does support capturing groups, but only with its NFA engines and the specialized one-pass DFA. None of those support reverse searches. In theory I wouldn't be opposed to adding them, but also supporting capture groups giving the same semantics as a forward search seems very unlikely.

Even if you had that power, I'm not convinced it would help you enormously anyway. Resolving capturing groups is what costs a lot. Likely much more than finding the start of a match with a DFA.

Now if your regexes are one-pass then the regex crate (once I bring regex-automata in) will use the one-pass DFA if possible which should help things. But you probably have a ton of bounded repeats in your regexes which would disqualify them from one-pass.

bradlarsen · 2022-12-15T19:34:19Z

bradlarsen
Dec 15, 2022
Author

Thank you @BurntSushi! Very informative. I will take a closer look and ensure that DfaAnchoredReverse is indeed being used. I expect that it will be, since Nosey Parker disables unicode matching, and doesn't use RegexSet at all.

Another hypothesis I have as to why some inputs end up spending ~80% of total time in regex matching is that certain patterns used in Nosey Parker end up producing many overlapping matches. Hyperscan has all-matches semantics rather than longest-match semantics, and so a pattern that ends with a quantifier is going to produce tons of candidate matches to be run through the second-stage matcher.

Also, AIUI, Hyperscan has start-of-match reporting. Does it just not work?

Hyperscan fails to compile several of the ~60 patterns currently in Nosey Parker when using its SOM_LEFTMOST option to try to get start-of-match info. It may well be possible in Nosey Parker to still compile many of the patterns with the start-of-match reporting enabled. I need to investigate this further.

0 replies

BurntSushi · 2022-12-15T19:56:18Z

BurntSushi
Dec 15, 2022
Maintainer

If you can give me instructions for how to reproduce your benchmark where 80% of the time is in the regex crate, I might be able to give you stronger guidance. :)

1 reply

bradlarsen Dec 15, 2022
Author

Thank you—I will have to get back to you on that!

bradlarsen · 2022-12-15T20:36:42Z

bradlarsen
Dec 15, 2022
Author

@BurntSushi Is there any way to determine programmatically if the DfaAnchoredReverse optimization has been applied to a compiled regex?

1 reply

BurntSushi Dec 15, 2022
Maintainer

No, definitely not. You'll want to patch the regex crate to print a log message or something.

But your profile will probably give you hints based on which functions are being called.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching inputs backwards #935

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Matching inputs backwards #935

bradlarsen Dec 15, 2022

tl;dr

Details

Replies: 4 comments · 4 replies

BurntSushi Dec 15, 2022 Maintainer

bradlarsen Dec 15, 2022 Author

BurntSushi Dec 15, 2022 Maintainer

bradlarsen Dec 15, 2022 Author

BurntSushi Dec 15, 2022 Maintainer

bradlarsen Dec 15, 2022 Author

bradlarsen Dec 15, 2022 Author

BurntSushi Dec 15, 2022 Maintainer

bradlarsen
Dec 15, 2022

Replies: 4 comments 4 replies

BurntSushi
Dec 15, 2022
Maintainer

bradlarsen Dec 15, 2022
Author

BurntSushi Dec 15, 2022
Maintainer

bradlarsen
Dec 15, 2022
Author

BurntSushi
Dec 15, 2022
Maintainer

bradlarsen Dec 15, 2022
Author

bradlarsen
Dec 15, 2022
Author

BurntSushi Dec 15, 2022
Maintainer