-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix boundary handling in Regex auto-atomicity optimization #79088
Conversation
This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression `a*b`, there's nothing the `a*` loop could give up when backtracking that would also match `b`, thus this can be transformed into `(?>a*)b`. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have `a+\b`, there's nothing that `a+` could give up that would enable the `\b` to match. However, if you have `a*\b`, since `\b` is a zero-width assertion, it is actually possible for the `a*` to give up something (the empty match) that could match `\b`, yet we're erroneously still converting that to be `(?>a*)\b`. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic. While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.
|
I'm trying to think if there is any other anchor case like this other than when the subsequent after the loop is a |
Thanks. Could you make the example more concrete? I'm not seeing how it'd be an issue. |
So my comment was mainly saying that your fix is very specific for the cases where the node following the loop is a word boundary, and I was wondering if there were any other anchors/cases where we should also avoid making the loop atomic if the loop could match 0 characters. I think I was wrong, My example was passing an empty string as input and using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'll enable that test you added for NonBacktracking too in the PR I'm preparing.
This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression
a*b
, there's nothing thea*
loop could give up when backtracking that would also matchb
, thus this can be transformed into(?>a*)b
. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you havea+\b
, there's nothing thata+
could give up that would enable the\b
to match. However, if you havea*\b
, since\b
is a zero-width assertion, it is actually possible for thea*
to give up something (the empty match) that could match\b
, yet we're erroneously still converting that to be(?>a*)\b
. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic.While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.
cc: @olsaarik, @joperezr