Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix boundary handling in Regex auto-atomicity optimization #79088

Merged
merged 1 commit into from
Dec 1, 2022

Conversation

stephentoub
Copy link
Member

This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression a*b, there's nothing the a* loop could give up when backtracking that would also match b, thus this can be transformed into (?>a*)b. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have a+\b, there's nothing that a+ could give up that would enable the \b to match. However, if you have a*\b, since \b is a zero-width assertion, it is actually possible for the a* to give up something (the empty match) that could match \b, yet we're erroneously still converting that to be (?>a*)\b. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic.

While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.

cc: @olsaarik, @joperezr

This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization.  This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic.  For example, given the expression `a*b`, there's nothing the `a*` loop could give up when backtracking that would also match `b`, thus this can be transformed into `(?>a*)b`.  As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them.  If you have `a+\b`, there's nothing that `a+` could give up that would enable the `\b` to match.  However, if you have `a*\b`, since `\b` is a zero-width assertion, it is actually possible for the `a*` to give up something (the empty match) that could match `\b`, yet we're erroneously still converting that to be `(?>a*)\b`.  The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic.

While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.
@ghost
Copy link

ghost commented Dec 1, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression a*b, there's nothing the a* loop could give up when backtracking that would also match b, thus this can be transformed into (?>a*)b. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have a+\b, there's nothing that a+ could give up that would enable the \b to match. However, if you have a*\b, since \b is a zero-width assertion, it is actually possible for the a* to give up something (the empty match) that could match \b, yet we're erroneously still converting that to be (?>a*)\b. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic.

While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.

cc: @olsaarik, @joperezr

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Text.RegularExpressions

Milestone: -

@joperezr
Copy link
Member

joperezr commented Dec 1, 2022

I'm trying to think if there is any other anchor case like this other than when the subsequent after the loop is a \b. For example, what about end of string like: a*\z? Wouldn't the atomic loop consume the end of string in a similar fashion?

@stephentoub
Copy link
Member Author

Wouldn't the atomic loop consume the end of string in a similar fashion?

Thanks. Could you make the example more concrete? I'm not seeing how it'd be an issue.

@joperezr
Copy link
Member

joperezr commented Dec 1, 2022

So my comment was mainly saying that your fix is very specific for the cases where the node following the loop is a word boundary, and I was wondering if there were any other anchors/cases where we should also avoid making the loop atomic if the loop could match 0 characters.

I think I was wrong, My example was passing an empty string as input and using a*\z as pattern, and I was mainly wondering if we could have a similar case where the fact that the loop is no longer backtracking could produce a failure in matching, but I was wrong.

Copy link
Contributor

@olsaarik olsaarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll enable that test you added for NonBacktracking too in the PR I'm preparing.

@stephentoub stephentoub merged commit 41f57b7 into dotnet:main Dec 1, 2022
@stephentoub stephentoub deleted the regexautoloopboundary branch December 1, 2022 19:30
@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants