NFA and DFA on match with ASCII word boundary #241

SeanRBurton · 2016-05-24T17:34:51Z

Running "(?-u:\\B)" (i.e. NotWordBoundaryAscii) against "0\u{7ef5e}" gives the match span (2, 2) using the NFA and (5, 5) using the default match engine.

The text was updated successfully, but these errors were encountered:

SeanRBurton · 2016-06-09T11:22:49Z

repro case:


extern crate regex;                                                             

use regex::internal::ExecBuilder;                                               

fn main() {                                                                     
    let res = "(?-u:\\B)";                                                      
    let re0 = ExecBuilder::new(&res).build().unwrap().into_byte_regex();        
    let re1 = ExecBuilder::new(&res).nfa().build().unwrap().into_byte_regex();  
    let s = "0\u{7ef5e}";                                                       
    println!("{:?}", re0.captures(&s.as_bytes()).unwrap().pos(0)); // Some((5, 5))
    println!("{:?}", re1.captures(&s.as_bytes()).unwrap().pos(0)); // Some((2, 2))
}

SeanRBurton · 2016-06-10T12:27:17Z

bump?

BurntSushi · 2016-06-10T12:28:07Z

Haven't had a chance to look at it yet, sorry. It certainly looks like a bug!

lukaslueg · 2016-06-30T18:18:13Z

Here are some examples (found by AFL, probably convoluted) which produce different results on ExecBuilder::new().build() and ExecBuild::new().nfa().build() when matching the same random sequence of bytes.

Z\xff\xff\x80A|^    // nfa() has three capture groups, normal has one
^|\x86\x85    // same as above
$\x84*    // nfa() has `.is_match()` as `false`, normal has it as `true`
(?P<a>\w.|\x64})*    // `find_iter()` differs: nfa() has `(0, 0), (1,1), (2,2) ...` while normal has `(0, 0), (1, 5), (6, 6) ...` among others

BurntSushi · 2016-07-09T23:41:00Z

@lukaslueg Can you file a separate bug? Also, what was the haystack?

BurntSushi · 2016-07-10T02:16:10Z

This one was a tough nut to crack. The positions reported by the NFA are trivially wrong because they don't fall on valid UTF-8 sequence boundaries. Because of that, the DFA results are indeed correct. If you used a bytes::Regex instead, then the semantics change completely, because the "not an ASCII word boundary" will now match between every pair of arbitrary bytes that doesn't correspond to a word boundary, even if they aren't valid ASCII/UTF-8.

It's not clear whether the bugs reported by @lukaslueg are related. I think I'll need at least the haystack text. (Some of the descriptions are confusing too. The first regex doesn't have any explicit capture groups, for example.)

…red. This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII word boundary") in the NFA engines could produce match positions at invalid UTF-8 sequence boundaries. The specific problem is that determining whether (?-u:\B) matches or not relies on knowing whether we must report matches only at UTF-8 boundaries, and this wasn't actually being taken into account. (Instead, we prefer to enforce this invariant in the compiler, so that the matching engines mostly don't have to care about it.) But of course, the zero-width assertions are kind of a special case all around, so we need to handle ASCII word boundaries differently depending on whether we require valid UTF-8. This bug was noticed because the DFA actually handles this correctly (by encoding ASCII word boundaries into the state machine itself, which in turn guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an inconsistency. Fix #241.

lukaslueg · 2016-07-11T18:25:57Z

I'll recompile everything with latest upstream and file new bugs should the issue reappear

BurntSushi added the bug label Jun 10, 2016

lukaslueg mentioned this issue Jun 28, 2016

apply AFL to regex #203

Closed

BurntSushi changed the title ~~Capture index bug.~~ NFA and DFA on match with ASCII word boundary Jul 9, 2016

BurntSushi mentioned this issue Jul 10, 2016

fix several small bugs found from fuzzing #262

Merged

BurntSushi closed this as completed in 84a2bf5 Jul 10, 2016

This was referenced Jul 11, 2016

More NotWordBoundaryAscii woes #264

Closed

NFA/DFA disagreement with EndText #265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFA and DFA on match with ASCII word boundary #241

NFA and DFA on match with ASCII word boundary #241

SeanRBurton commented May 24, 2016 •

edited

Loading

SeanRBurton commented Jun 9, 2016 •

edited

Loading

SeanRBurton commented Jun 10, 2016

BurntSushi commented Jun 10, 2016

lukaslueg commented Jun 30, 2016

BurntSushi commented Jul 9, 2016

BurntSushi commented Jul 10, 2016

lukaslueg commented Jul 11, 2016

NFA and DFA on match with ASCII word boundary #241

NFA and DFA on match with ASCII word boundary #241

Comments

SeanRBurton commented May 24, 2016 • edited Loading

SeanRBurton commented Jun 9, 2016 • edited Loading

SeanRBurton commented Jun 10, 2016

BurntSushi commented Jun 10, 2016

lukaslueg commented Jun 30, 2016

BurntSushi commented Jul 9, 2016

BurntSushi commented Jul 10, 2016

lukaslueg commented Jul 11, 2016

SeanRBurton commented May 24, 2016 •

edited

Loading

SeanRBurton commented Jun 9, 2016 •

edited

Loading