Fix boundary handling in Regex auto-atomicity optimization #79088

stephentoub · 2022-12-01T14:14:16Z

This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression a*b, there's nothing the a* loop could give up when backtracking that would also match b, thus this can be transformed into (?>a*)b. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have a+\b, there's nothing that a+ could give up that would enable the \b to match. However, if you have a*\b, since \b is a zero-width assertion, it is actually possible for the a* to give up something (the empty match) that could match \b, yet we're erroneously still converting that to be (?>a*)\b. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic.

While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.

cc: @olsaarik, @joperezr

This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression `a*b`, there's nothing the `a*` loop could give up when backtracking that would also match `b`, thus this can be transformed into `(?>a*)b`. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have `a+\b`, there's nothing that `a+` could give up that would enable the `\b` to match. However, if you have `a*\b`, since `\b` is a zero-width assertion, it is actually possible for the `a*` to give up something (the empty match) that could match `\b`, yet we're erroneously still converting that to be `(?>a*)\b`. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic. While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.

ghost · 2022-12-01T14:14:31Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression a*b, there's nothing the a* loop could give up when backtracking that would also match b, thus this can be transformed into (?>a*)b. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have a+\b, there's nothing that a+ could give up that would enable the \b to match. However, if you have a*\b, since \b is a zero-width assertion, it is actually possible for the a* to give up something (the empty match) that could match \b, yet we're erroneously still converting that to be (?>a*)\b. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic.

While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.

cc: @olsaarik, @joperezr

Author:	stephentoub
Assignees:	stephentoub
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

joperezr · 2022-12-01T18:45:21Z

I'm trying to think if there is any other anchor case like this other than when the subsequent after the loop is a \b. For example, what about end of string like: a*\z? Wouldn't the atomic loop consume the end of string in a similar fashion?

stephentoub · 2022-12-01T18:49:36Z

Wouldn't the atomic loop consume the end of string in a similar fashion?

Thanks. Could you make the example more concrete? I'm not seeing how it'd be an issue.

joperezr · 2022-12-01T18:53:55Z

So my comment was mainly saying that your fix is very specific for the cases where the node following the loop is a word boundary, and I was wondering if there were any other anchors/cases where we should also avoid making the loop atomic if the loop could match 0 characters.

I think I was wrong, My example was passing an empty string as input and using a*\z as pattern, and I was mainly wondering if we could have a similar case where the fact that the loop is no longer backtracking could produce a failure in matching, but I was wrong.

olsaarik

Looks good to me. I'll enable that test you added for NonBacktracking too in the PR I'm preparing.

ghost assigned stephentoub Dec 1, 2022

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Dec 1, 2022

build-analysis bot mentioned this pull request Dec 1, 2022

CI build failure: Build MacCatalyst x64 Release AllSubsets_Mono - XcodeBuildApp task failed unexpectedly - Could not find System.Runtime.Tests.app #78778

Closed

joperezr approved these changes Dec 1, 2022

View reviewed changes

olsaarik approved these changes Dec 1, 2022

View reviewed changes

stephentoub merged commit 41f57b7 into dotnet:main Dec 1, 2022

stephentoub deleted the regexautoloopboundary branch December 1, 2022 19:30

ghost locked as resolved and limited conversation to collaborators Jan 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix boundary handling in Regex auto-atomicity optimization #79088

Fix boundary handling in Regex auto-atomicity optimization #79088

stephentoub commented Dec 1, 2022

ghost commented Dec 1, 2022

joperezr commented Dec 1, 2022

stephentoub commented Dec 1, 2022

joperezr commented Dec 1, 2022

olsaarik left a comment

Fix boundary handling in Regex auto-atomicity optimization #79088

Fix boundary handling in Regex auto-atomicity optimization #79088

Conversation

stephentoub commented Dec 1, 2022

ghost commented Dec 1, 2022

joperezr commented Dec 1, 2022

stephentoub commented Dec 1, 2022

joperezr commented Dec 1, 2022

olsaarik left a comment

Choose a reason for hiding this comment