Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFA and DFA on match with ASCII word boundary #241

Closed
SeanRBurton opened this issue May 24, 2016 · 7 comments
Closed

NFA and DFA on match with ASCII word boundary #241

SeanRBurton opened this issue May 24, 2016 · 7 comments
Labels

Comments

@SeanRBurton
Copy link
Contributor

SeanRBurton commented May 24, 2016

Running "(?-u:\\B)" (i.e. NotWordBoundaryAscii) against "0\u{7ef5e}" gives the match span (2, 2) using the NFA and (5, 5) using the default match engine.

@SeanRBurton
Copy link
Contributor Author

SeanRBurton commented Jun 9, 2016

repro case:


extern crate regex;                                                             

use regex::internal::ExecBuilder;                                               

fn main() {                                                                     
    let res = "(?-u:\\B)";                                                      
    let re0 = ExecBuilder::new(&res).build().unwrap().into_byte_regex();        
    let re1 = ExecBuilder::new(&res).nfa().build().unwrap().into_byte_regex();  
    let s = "0\u{7ef5e}";                                                       
    println!("{:?}", re0.captures(&s.as_bytes()).unwrap().pos(0)); // Some((5, 5))
    println!("{:?}", re1.captures(&s.as_bytes()).unwrap().pos(0)); // Some((2, 2))
}  

@SeanRBurton
Copy link
Contributor Author

bump?

@BurntSushi
Copy link
Member

Haven't had a chance to look at it yet, sorry. It certainly looks like a bug!

@lukaslueg
Copy link
Contributor

Here are some examples (found by AFL, probably convoluted) which produce different results on ExecBuilder::new().build() and ExecBuild::new().nfa().build() when matching the same random sequence of bytes.

Z\xff\xff\x80A|^    // nfa() has three capture groups, normal has one
^|\x86\x85    // same as above
$\x84*    // nfa() has `.is_match()` as `false`, normal has it as `true`
(?P<a>\w.|\x64})*    // `find_iter()` differs: nfa() has `(0, 0), (1,1), (2,2) ...` while normal has `(0, 0), (1, 5), (6, 6) ...` among others

@BurntSushi BurntSushi changed the title Capture index bug. NFA and DFA on match with ASCII word boundary Jul 9, 2016
@BurntSushi
Copy link
Member

@lukaslueg Can you file a separate bug? Also, what was the haystack?

@BurntSushi
Copy link
Member

This one was a tough nut to crack. The positions reported by the NFA are trivially wrong because they don't fall on valid UTF-8 sequence boundaries. Because of that, the DFA results are indeed correct. If you used a bytes::Regex instead, then the semantics change completely, because the "not an ASCII word boundary" will now match between every pair of arbitrary bytes that doesn't correspond to a word boundary, even if they aren't valid ASCII/UTF-8.

It's not clear whether the bugs reported by @lukaslueg are related. I think I'll need at least the haystack text. (Some of the descriptions are confusing too. The first regex doesn't have any explicit capture groups, for example.)

BurntSushi added a commit that referenced this issue Jul 10, 2016
…red.

This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII
word boundary") in the NFA engines could produce match positions at invalid
UTF-8 sequence boundaries. The specific problem is that determining whether
(?-u:\B) matches or not relies on knowing whether we must report matches
only at UTF-8 boundaries, and this wasn't actually being taken into
account. (Instead, we prefer to enforce this invariant in the compiler, so
that the matching engines mostly don't have to care about it.) But of
course, the zero-width assertions are kind of a special case all around,
so we need to handle ASCII word boundaries differently depending on
whether we require valid UTF-8.

This bug was noticed because the DFA actually handles this correctly (by
encoding ASCII word boundaries into the state machine itself, which in turn
guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an
inconsistency.

Fix #241.
@lukaslueg
Copy link
Contributor

I'll recompile everything with latest upstream and file new bugs should the issue reappear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants