Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StressTestDeepNestingOfConcat takes a lot longer on NonBacktracking than other engines #60645

Closed
stephentoub opened this issue Oct 19, 2021 · 12 comments

Comments

@stephentoub
Copy link
Member

Here's how long the various tests took with the different engines:

StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"[a-d]?[a-e]?[a-f]?[a-g]?[a-h]?\", anchor: \"$\", input: \"abcda\", pattern_repetition: 400, input_repetition: 4): 42.1850199
StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"[a-e]*\", anchor: \"$\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 20): 22.9797034
StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"[a-z]\", anchor: \"\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 400): 2.0449147
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"[a-e]*\", anchor: \"$\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 20): 0.8606951
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"[a-e]*\", anchor: \"$\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 20): 0.6647422
StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"(a|A)\", anchor: \"\", input: \"aAaAa\", pattern_repetition: 2000, input_repetition: 400): 0.6331361
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"[a-d]?[a-e]?[a-f]?[a-g]?[a-h]?\", anchor: \"$\", input: \"abcda\", pattern_repetition: 400, input_repetition: 4): 0.6089077
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"[a-d]?[a-e]?[a-f]?[a-g]?[a-h]?\", anchor: \"$\", input: \"abcda\", pattern_repetition: 400, input_repetition: 4): 0.6065388
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"(a|A)\", anchor: \"\", input: \"aAaAa\", pattern_repetition: 2000, input_repetition: 400): 0.3276416
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"(a|A)\", anchor: \"\", input: \"aAaAa\", pattern_repetition: 2000, input_repetition: 400): 0.3066884
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"[a-z]\", anchor: \"\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 400): 0.1756175
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"[a-z]\", anchor: \"\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 400): 0.1623067
@ghost
Copy link

ghost commented Oct 19, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Here's how long the various tests took with the different engines:

StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"[a-d]?[a-e]?[a-f]?[a-g]?[a-h]?\", anchor: \"$\", input: \"abcda\", pattern_repetition: 400, input_repetition: 4): 42.1850199
StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"[a-e]*\", anchor: \"$\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 20): 22.9797034
StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"[a-z]\", anchor: \"\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 400): 2.0449147
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"[a-e]*\", anchor: \"$\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 20): 0.8606951
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"[a-e]*\", anchor: \"$\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 20): 0.6647422
StressTestDeepNestingOfConcat(engine: NonBacktracking, pattern: \"(a|A)\", anchor: \"\", input: \"aAaAa\", pattern_repetition: 2000, input_repetition: 400): 0.6331361
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"[a-d]?[a-e]?[a-f]?[a-g]?[a-h]?\", anchor: \"$\", input: \"abcda\", pattern_repetition: 400, input_repetition: 4): 0.6089077
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"[a-d]?[a-e]?[a-f]?[a-g]?[a-h]?\", anchor: \"$\", input: \"abcda\", pattern_repetition: 400, input_repetition: 4): 0.6065388
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"(a|A)\", anchor: \"\", input: \"aAaAa\", pattern_repetition: 2000, input_repetition: 400): 0.3276416
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"(a|A)\", anchor: \"\", input: \"aAaAa\", pattern_repetition: 2000, input_repetition: 400): 0.3066884
StressTestDeepNestingOfConcat(engine: Compiled, pattern: \"[a-z]\", anchor: \"\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 400): 0.1756175
StressTestDeepNestingOfConcat(engine: Interpreter, pattern: \"[a-z]\", anchor: \"\", input: \"abcde\", pattern_repetition: 2000, input_repetition: 400): 0.1623067
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 19, 2021
@stephentoub
Copy link
Member Author

cc: @veanes, @olsaarik

@veanes
Copy link
Contributor

veanes commented Oct 20, 2021

I'm suspecting large overhead due to overlapping derivatives in unions (regexes) that are created due to ?. Looking at the alternative SymbolicNFA for computing derivatives to see a comparison as to how much these large unions (internally represented by dictionaries and hash-sets of SymbolicRegexNode) contribute would be a way to see if we can leverage SymbolicNFA, this is something I'll investigate with @olsaarik . At this point this is speculative.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 28, 2021
@stephentoub
Copy link
Member Author

stephentoub commented Apr 5, 2022

@veanes, I just tried this test again, and it's gotten way worse than before. After minutes of it running, I attached a debugger, and I see the builder's node count in the hundreds of millions:
image
Can you please take a look? It seems like something has gone off the rails.

@veanes
Copy link
Contributor

veanes commented Apr 6, 2022

I'm investigating it. Been debugging and stepping through the code paths.
The problem is related to CreateDerivative of nested Concat with captures and OrderedOr with
simulateBacktracking=true, that causes the blowup. Some of this control-flow is new to me. To fully understand it I may need to discuss it with @olsaarik, I'll keep investigating for now to pinpoint the problem more exactly.

@stephentoub
Copy link
Member Author

Thanks.

@veanes
Copy link
Contributor

veanes commented Apr 13, 2022

I was a bit stuck on this, but after a meeting yesterday with Olli, we figured out the cause.
The intermediate OrderedOr nodes are internalized as nodes in NFA mode (that must not happen) that causes the blowup and essentially defeats the purpose of the NFA mode.
I'm in the process of fixing this now.

@joperezr
Copy link
Member

joperezr commented May 4, 2022

@veanes was this fixed by #68307?

@veanes
Copy link
Contributor

veanes commented May 4, 2022

working on it still, not fully fixed yet, but in the process

@joperezr
Copy link
Member

joperezr commented May 4, 2022

Got it, that's fine I know that you are working on some related changes but I was just wondering if this particular one could be closed or not.

@veanes
Copy link
Contributor

veanes commented May 4, 2022

I think they would be closed, I am currently implementing and testing necessary simplification rules that keep the derivatives small, e.g. ones such as Y|XY --> X?Y
and more importantly a rule such as X{0,n}?Y | X^nXY --> X{0,n+1}?Y where X^n stands for n explicit repetitions of X (X and Y are arbitrary regexes here)
otherwise the blowup of alternations still cause a intermediate blowup (despite the earlier fixes) in the stress tests

@veanes
Copy link
Contributor

veanes commented Jun 5, 2022

The issue is now fixed, by #69530.

@veanes veanes closed this as completed Jun 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants