-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFA and DFA on match with ASCII word boundary #241
Comments
repro case:
|
bump? |
Haven't had a chance to look at it yet, sorry. It certainly looks like a bug! |
Here are some examples (found by AFL, probably convoluted) which produce different results on
|
@lukaslueg Can you file a separate bug? Also, what was the haystack? |
This one was a tough nut to crack. The positions reported by the NFA are trivially wrong because they don't fall on valid UTF-8 sequence boundaries. Because of that, the DFA results are indeed correct. If you used a It's not clear whether the bugs reported by @lukaslueg are related. I think I'll need at least the haystack text. (Some of the descriptions are confusing too. The first regex doesn't have any explicit capture groups, for example.) |
…red. This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII word boundary") in the NFA engines could produce match positions at invalid UTF-8 sequence boundaries. The specific problem is that determining whether (?-u:\B) matches or not relies on knowing whether we must report matches only at UTF-8 boundaries, and this wasn't actually being taken into account. (Instead, we prefer to enforce this invariant in the compiler, so that the matching engines mostly don't have to care about it.) But of course, the zero-width assertions are kind of a special case all around, so we need to handle ASCII word boundaries differently depending on whether we require valid UTF-8. This bug was noticed because the DFA actually handles this correctly (by encoding ASCII word boundaries into the state machine itself, which in turn guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an inconsistency. Fix #241.
I'll recompile everything with latest upstream and file new bugs should the issue reappear |
Running
"(?-u:\\B)"
(i.e.NotWordBoundaryAscii
) against"0\u{7ef5e}"
gives the match span(2, 2)
using the NFA and(5, 5)
using the default match engine.The text was updated successfully, but these errors were encountered: