-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegexSet - enhancement #732
Comments
Sorry, but I don't understand what you're proposing. Can you make it more concrete in terms of what APIs you would add? |
API wise it basically one optional argument in the constructor. For example: RegexSet::new(&[
r"[a-z]+@[a-z]+\.(com|org|net)",
r"[a-z]+\.(com|org|net)",
OR_MATCH_MODE
]).unwrap(); The main point is changing the matching algorithm to use the one merged regex instead of the existing flow. |
A very basic draft of this, you can run with use regex::Regex;
use regex::RegexSet;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::time::Instant;
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where
P: AsRef<Path>,
{
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
fn read_eme(filename: &str) -> Vec<String> {
let mut regexp_array = Vec::new();
if let Ok(lines) = read_lines(filename) {
for line in lines {
if let Ok(ip) = line {
let _re = Regex::new(&ip).unwrap();
regexp_array.push(ip.clone());
}
}
}
regexp_array
}
fn read_eme2(filename: &str) -> String {
let mut count = 0;
let mut regexp = String::new();
if let Ok(lines) = read_lines(filename) {
for line in lines {
if let Ok(ip) = line {
if count > 0 {
regexp.push('|');
}
regexp.push_str(&format!("(?P<GROUP_{}>{})", count, &ip));
count += 1;
}
}
}
regexp
}
fn run_regexp() {
let regexp_str = read_eme2("./patterns.eme");
println!("regexp_str: '{:?}'", regexp_str);
let regex = Regex::new(®exp_str).unwrap();
let mut count = 0;
let mut t0 = Instant::now();
if let Ok(lines) = read_lines("./test_100.log") {
for line in lines {
if count % 10000 == 0 {
let t1 = Instant::now();
println!("Line {:?} , {:?}, {:?}", &line, count, t1 - t0);
t0 = t1;
}
if let Ok(ip) = line {
for cap in regex.captures_iter(&ip) {
println!("cap: '{:?}', line: '{:?}'", cap, ip);
}
}
count += 1;
}
}
}
fn run_regexp_array() {
let regexp_array = read_eme("./patterns.eme");
let regexp_set = RegexSet::new(®exp_array).unwrap();
let mut count = 0;
let mut t0 = Instant::now();
if let Ok(lines) = read_lines("./test.log") {
for line in lines {
if count % 10000 == 0 {
let t1 = Instant::now();
println!("Line {:?} , {:?}, {:?}", &line, count, t1 - t0);
t0 = t1;
}
if let Ok(ip) = line {
let matches: Vec<_> = regexp_set.matches(&ip).into_iter().collect();
for m in matches {
println!("m: '{:?}', pattern: '{:?}', line: '{:?}'", m, regexp_array[m], ip);
}
}
count += 1;
}
}
}
fn main() {
run_regexp();
//run_regexp_array();
} |
I don't think API changes are the right path here. If there's a performance difference then we should fix that. I believe there are tickets for that. |
I agree about the API changes. Also found some related issue #259 |
OK, but it would probably be better to discuss what specifically you'll be doing in the PR if possible. Even if it's just a sketch. I may not merge it based purely on its design, for example. |
I think having to preprocess the capture group names on the caller side is probably the right path here. I don't think complicating the regex crate API is the right path here. |
I tried using
RegexSet
and I noticed that for my specific usage the performance was significantly slower than merging all my regex into one as inregex1|regex2|...|regexN
.I do know that when you combine it into one regex you lose the ability to get all matches without iterating it.
From my experience, in many applications that rely heavily on MANY regexp, for example a log processing, the match rate is usually very low (most messages are uninteresting info vs. error, as an example) thus this trade off is worth it.
Although it's not really a trade off, since when you find a matching line, you can run the existing RegexSet on it and get all the required info.
There is another caveat which can be resolved easily, named groups cannot be duplicated.
I workaround this before by taking any user input regex and doing two changes to it:
(?P<GROUP_NUM>regexp)
.(?P<Name>.*)
, it will be rewritten as (?P<GROUP_5>(?P<GROUP_5_NAME>.*))`.I suggest this as an extra mode to enhance the existing
RegexSet
, I'll be happy to work on implementing it if the maintainers think it's worthwhile.The text was updated successfully, but these errors were encountered: