why does RegexSet go slower as more patterns are added? #881
-
Hi, excellent crate, just asking a concrete thing, as it is stunning me.
If one has hundreds or thousands of regexes to match repeatedly (like a URL router for a complex web application or a user agent matcher), then a regex set can realize huge performance gains. I write because I find a behaviour where there is a gain of performance until a number of regex in the regexset, then it drops, I am chunking a list of patterns and applying regexset of the chunks size, and find that:
This has no sense for me, it is a process with:
I was expecting to have better performance on larger chunk sizes at the cost of more memory used. It could be some design being doing wrong? Here is the method implemented fn regexset_match(&self, regexset_chunk_size: usize) -> BTreeMap<usize, HashSet<usize>> {
let start = Instant::now();
let texts_len: u64 = self.texts.len() as u64;
let regex_vec_slices: Vec<&[String]> = self
.regex_operator
.preprocessed_valid_patterns
.chunks(regexset_chunk_size)
.collect();
let mut n_total_matches = 0;
let mut chunk_iteration = 0;
let mut regex_matches_set: BTreeMap<usize, HashSet<usize>> = BTreeMap::new();
for regex_list in regex_vec_slices {
let my_regex_set = self.regex_config.build_regexset(regex_list);
let pb = ProgressBar::new(texts_len);
let mut n_matches: usize = 0;
for (text_index, transcript) in self.texts.iter().enumerate() {
pb.inc(1);
let matches = my_regex_set.matches(&transcript);
n_matches += matches.len();
if matches.len() > 0 {
for match_key in matches {
let truly_regex_index = chunk_iteration + match_key;
let elt = regex_matches_set
.entry(truly_regex_index)
.or_insert(HashSet::new());
elt.insert(text_index);
}
}
}
pb.finish_with_message("done");
n_total_matches += n_matches;
chunk_iteration += regexset_chunk_size;
}
info!("Total number of matches in regex set: {}", n_total_matches);
let match_duration = start.elapsed();
info!(
"Time elapsed on regexset match stage by chunks of {regexset_chunk_size} is: {match_duration:?}"
);
regex_matches_set
}
```
Version is `regex = "1.6"`.
Kind regards. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Without a reproduction I can't really answer your question. :-/ If you search the issue tracker for RegexSet, you'll likely see that its performance can be somewhat variable. Without a reproduction, all I can do is say something very generic. Namely, like virtually anything that does optimizations, there are thresholds everywhere. Thresholds, for example, might dictate when one regex engine is used in place of another. Those thresholds are often determined through a combination of "what makes sense" and what has been empirically observed in at least some cases. But those thresholds may not be correct in all cases. Sometimes they can be tweaked, but it will often just make some other case slower. Now, some thresholds are actually configurable in the public API. You haven't shared the code you're using to build the regex set and didn't mention them, so I assume you don't know about them. The most important one is |
Beta Was this translation helpful? Give feedback.
Without a reproduction I can't really answer your question. :-/
If you search the issue tracker for RegexSet, you'll likely see that its performance can be somewhat variable.
Without a reproduction, all I can do is say something very generic. Namely, like virtually anything that does optimizations, there are thresholds everywhere. Thresholds, for example, might dictate when one regex engine is used in place of another. Those thresholds are often determined through a combination of "what makes sense" and what has been empirically observed in at least some cases. But those thresholds may not be correct in all cases. Sometimes they can be tweaked, but it will often just make some other case …