Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Regex Scraped from crates.io #468

Closed
ethanpailes opened this issue Apr 23, 2018 · 6 comments
Closed

Test Regex Scraped from crates.io #468

ethanpailes opened this issue Apr 23, 2018 · 6 comments

Comments

@ethanpailes
Copy link
Contributor

As part of evaluating my masters thesis, I scraped crates.io for regex and ran the resulting regex though my compiler to see how many could be optimized. I don't think it would be too hard to clean up my scraping script a bit and then write a test which executes each of the regex on a quickcheck generated string with all (3? 5? 6? depends how you count them) backends. I'm not sure when I'll do this, but I wanted to leave a note here so that I don't forget.

@BurntSushi
Copy link
Member

♥️

That would be interesting! One thing worth pointing out that a quickcheck generated string is very unlikely to produce a case that causes a match to happen. Instead, it would only test non-match agreement. Which still seems like a worthwhile thing!

@ethanpailes
Copy link
Contributor Author

Hmm. Good point. One thing I remember reading in a note Russ Cox made about the testing approach for RE2 is that he wrote some code to construct a random matching string from a regex, so it would might be worthwhile to give that a crack. The two sources of random input would probably do a pretty good job of testing both the positive and negative cases.

@BurntSushi
Copy link
Member

Yup, that's another good avenue to try!

@ethanpailes
Copy link
Contributor Author

After a first pass at this just using quickcheck to generate random input, I've turned up the following failing test cases.

extern crate regex;

#[test]
fn word_boundary_backtracking_default_mismatch() {
    use regex::internal::ExecBuilder;

    let backtrack_re = ExecBuilder::new(r"\b")
        .bounded_backtracking()
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let default_re = ExecBuilder::new(r"\b")
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let input = "䅅\\u{a0}";

    let fi1 = backtrack_re.find_iter(input);
    let fi2 = default_re.find_iter(input);
    for (m1, m2) in fi1.zip(fi2) {
        assert_eq!(m1, m2);
    }
}

#[test]
fn uppercut_s_backtracking_bytes_default_bytes_mismatch() {
    use regex::internal::ExecBuilder;

    let backtrack_bytes_re = ExecBuilder::new("^S")
        .bounded_backtracking()
        .only_utf8(false)
        .build()
        .map(|exec| exec.into_byte_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let default_bytes_re = ExecBuilder::new("^S")
        .only_utf8(false)
        .build()
        .map(|exec| exec.into_byte_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let input = vec![83, 83];

    let s1 = backtrack_bytes_re.split(&input);
    let s2 = default_bytes_re.split(&input);
    for (chunk1, chunk2) in s1.zip(s2) {
        assert_eq!(chunk1, chunk2);
    }
}

#[test]
fn unicode_lit_star_backtracking_utf8bytes_default_utf8bytes_mismatch() {
    use regex::internal::ExecBuilder;

    let backtrack_bytes_re = ExecBuilder::new(r"^(?u:\*)")
        .bounded_backtracking()
        .bytes(true)
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let default_bytes_re = ExecBuilder::new(r"^(?u:\*)")
        .bytes(true)
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let input = "**";

    let s1 = backtrack_bytes_re.split(input);
    let s2 = default_bytes_re.split(input);
    for (chunk1, chunk2) in s1.zip(s2) {
        assert_eq!(chunk1, chunk2);
    }
}

The last two look like the are probably dups.

My work on this currently lives here, and I think it is basically ready for a PR except for a few config issues.

  1. Right now to run the checks I just added a new test binary and marked the entrypoint with #[test]. This is bad for a few different reasons.
    • The tests take a long time to run, and I'm not sure they should be in regular CI. If it is possible to have two different test profiles (one for regular CI and a more complete one for releases and major new features), that would be great.
    • There is a lot of work going on for one #[test], and we really should be seeing output to the screen to be able to monitor progress (--nocapture works for this when you are focusing on the test, but it would not fit into a bigger run of the test suite). Ideally when you do cargo test it would just invoke a binary without capturing the output and then report that the suite failed if the exit code is non-zero. I think I once saw that there was a way to do this on a stackoverflow post, but my google-fu has failed me.
  2. The specific cases that I just pulled out should definitly be part of the regular test suite, but they are not dependant on the definition of the regex! or regex_set! test macro, so they should not be run for every different test config. I'm not sure where the best place to put them for that is. I've just stashed them in the test_crates_regex test that I made for now, but if we turn that off for CI, it is the wrong place to put them.

This was referenced Apr 27, 2018
@ethanpailes
Copy link
Contributor Author

It may be worth looking at https://crates.io/crates/regex_generate when I get around to generating matching strings from a regex.

@BurntSushi
Copy link
Member

I'm going to say that this is closed by the work done a while back in #472. If there's something else we should, please feel free to file a new issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants