-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: RegexOptions.Constrained #57891
Comments
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsBackground and motivationThe .NET regex engine today works based on an NFA/backtracking-based engine. For .NET 7, we're exploring adding an opt-in DFA-based engine. To opt-in, we'd use a new RegexOptions enum value. API Proposalnamespace System.Text.RegularExpressions
{
public enum RegexOptions
{
...
Constrained = 0x0400
}
} Other name choices could include DFA, Deterministic, Safe, Predictable, ... API Usagevar r = new Regex(..., RegexOptions.Constrained); RisksIt is opt-in as the engine will almost certainly have some limitations on what patterns are supported (e.g. look behinds), impact on capture semantics, etc.
|
Do you envision an analyzer that could flag literals with patterns that wouldn't support the constrained enum? |
Yes, but we're not there yet. My expectation is an analyzer would also be aligned with a regex source generator (which will necessitate factoring out the parser into an internal reusable component) and wouldn't be specific to this particular option, flagging invalid option combinations and invalid patterns based on the supplied options. |
I vote for the name "Deterministic". |
@TonyValenti I don't like that option, because to me, it would imply that the default option is somehow nondeterminstic, i.e. that it is not guaranteed to produce the same result every time. (Yes, it is implemented using a nondeterministic finite automaton, but that doesn't make the code itself nondeterministic.) |
@svick good point! What is the advantage of a DFA vs NFA? Maybe it shouldn't be an option and the engine should pick after analyzing the regex. |
The former can provide guaranteed non-exponential bounds on execution time.
For anything other than IsMatch, it needs to be explicit as it's not only limited in what patterns it can support, it changes the execution model in a way that can influence observable results. |
@stephentoub Do you imagine any sort of security guarantee that could be made here, e.g. that it's allowable to pass potentially hostile regex patterns as long as it's in constrained mode? If so, I imagine it would be interesting to expose an API "can this pattern be matched in constrained mode?". |
@GrabYourPitchforks Related to that, I have often wanted a Regex.TryParse method. Today I have to wrap a new(...) inside of a try and catch the exception. |
I think it'd be reasonable to have an IsValid(string, RegexOptions) method that took both a pattern and options and returned whether the combination is acceptable... that could mean the pattern is bad in general, the options themselves are invalid, or the pattern is invalid with those options. |
The reason I asked about hostile inputs specifically is that we have a bunch of customers (think web-based editors) who allow arbitrary users to enter custom regex patterns for searching text. Right now they have to use custom configuration for the regex through a combination of manual pattern inspection and relying on the default timeout. If we could give them something guaranteed safer along with an API to use it successfully, I'm sure they'd be interested. |
namespace System.Text.RegularExpressions
{
public enum RegexOptions
{
Constrained = 0x0400
}
} |
(flipping back to api-ready-for-review so it's on our radar again) |
Is it a correct takeaway that the future regex source generator will inherit same limitations as the proposed DFA (no look behinds, no sub- expressions / captures), or would it be possible to fully simulate NFA in the source generator? If latter is possible, would there still be significance (due to performance benefits, design simplicity?) for consumer to prefer DFA over existing NFA or future regex srcgen? |
Call it RegexOptions.Optimized. Everyone will use it. |
And if you don't use Optimized, I would suggest picking a name that conveys you are gaining a benefit (Optimized, Minimized, etc) as opposed to loosing something. Also, if there is a TryParse equivalent, the regex constructor could silently call it with the "Optimized" parameter. It if succeeds, it can use the optimized regex and if not, it can fall back to the default. |
No :) That's not what it is. Let's take a step back... In the regex world, engines are generally divided into one of two camps, deemed NFA and DFA (while not necessarily strictly tied to the theoretical meaning of those terms in implementation). NFA-based engines process regular expressions more closely aligned with how you might think of them working. Imagine a regex "abc|def|ghi". My brain at least thinks of that being processed as "try to match abc, if that fails try to match def, if that fails try to match ghi". That's basically what an NFA-based engine does, albeit with lots of possible optimizations on top of it. It tries one thing, and if that fails it backs up and it tries the next, and if that fails it backs up and tries the next. Etc. That backing up is called backtracking. NFA-based engines are able to track a lot of information about the current state of processing, which is why for example the .NET implementation is able to easily track all captures for a given capture within a loop. They can be very efficient... as long as there's not a lot of backtracking involved. Once there is a lot of backtracking, the worst-case efficiency can blow up to exponential cost in the length of the input; this is often referred to "excessive backtracking" or "catastrophic backtracking", and often occurs when there are loops inside of loops, e.g. try using regex101.com to match the regex NFA-based engines are called that because they're logically represented as non-deterministic finite automata, or NFA. There's a node in a graph for each state of the match, with one or more transitions to other nodes based on the character next consumed from the input. If I had the expression I'm waving my hands here, but that's the general gist. So, .NET currently has an NFA-based engine, which works quite well (and as of .NET 5, very competitively from a performance perspective), but because it does employ backtracking for some patterns, developers need to be careful not to allow arbitrary regex patterns to be supplied, or it could be a DoS vector (in particular if no timeout is supplied), and if a developer themselves authors a pattern that could suffer from catastrophic backtracking, similarly needs to be careful about accepting arbitrary inputs to that regex. For .NET 7, we've been working with Microsoft Research (MSR) on a new DFA-based implementation, and it's coming along quite well. Now, to the specific questions asked on this thread...
That is a goal. The primary benefit of this mode is that it can place guarantees on worst-case execution time in the length of the input.
I'm not convinced of that yet. The caveats will include limits on what can be in the pattern (though we will hopefully be able to reduce those to a bare minimum), changes in results (if there are multiple ways a pattern could match input, you might get different answers with the new engine than you do with the current engine), changes in performance characteristics (it's very likely some not-unusual patterns will run slower on average), and changes in capture semantics (functionality like capturing all versions of a capture in a loop will not work). If a regex is being exposed to untrusted patterns, yes. If a regex is being exposed to untrusted inputs, it would depend on the pattern. If everything is trusted, there's no real benefit to the new engine other than if it ends up running faster for a given expression; it might in some cases, especially if there's backtracking, and will likely be slower in others. Guidance here can evolve as we learn more about what exactly we'll ship and what its characteristics end up being.
No. The plan for the regex source generator is to do the same thing as RegexOptions.Compiled does, except emitting C# rather than IL. If you use the source generator with existing options, you'll get C# akin to the source you get today. If you use the source generator with the new RegexOptions.Constrained (or whatever we call it option), you'd get source generated for the DFA-mode (though that hasn't been implemented yet for any form of compilation).
As noted, it's not just about what patterns it supports, but also about the behavior of execution. For IsMatch, we can under the covers switch between engines if that's deemed fruitful and safe. But for Match, Matches, Replace, Split, etc., those execution characteristics have observable behavior impact, and it'd be a breaking change in many situations to switch engine. I also want to reiterate that this is not an "Optimized" mode, so we should not call it that ;-) It provides guarantees around worst-case behavior, and it might end up making some patterns run much faster on average (especially if there's backtracking involved, e.g. a pattern with lots of alternation and/or loops), but it can also consume significantly more memory and make some patterns run slower. |
What is the code size of the new regex engine? Do we need to worry about making it naturally trimmable? The proposed API won't make it naturally trimmable. |
Right now it's a few hundred K, though I expect that to come down some as common parts are refactored out, dead code from previous incarnations is eliminated, etc. As an option it'll be trimmable to the same extent RegexOptions.Compiled is trimmable today: if you don't pass any RegexOptions, we can trim it away, but if you do, we wouldn't be able to. I do expect it could be trimmed away if you intead use the source generator, just as I'd expect that for the current interpreter and in-memory compiler implementations; that should be a goal for the not-yet-written source generator, including adding new APIs if necessary to enable that. What do you suggest instead? |
We can consider exposing the new regex engine via new set of factory APIs instead of a flag. It would make it naturally trimmable. I am not sure whether it is worth it or what it would look like exactly. |
Or as the basis for a new, lighterweight regex API, more similar to eg PCRE API, without the rich match/group objects (that in some cases the new engine can't support). This API also be suitable for being fully span based. |
This name for the option came up in another discussion thread. |
Yup, I'd be ok with NonBacktracking.
In .NET Core prior to Compiled being implemented, it was just a nop. Seems we could do the same thing here. |
Yup, unless we anticipate an observable change in behavior (other than perf) - I assume we don't. BTW, right now Compiled effectively is used to mean "prioritize throughput and you can assume a JIT exists" and that would continue to be its meaning with the DFA engine. Conversely "prioritize throughput but you can't assume a JIT exists" would use the source generator. |
Sounds good. I'll make RegexOptions.Compiled behave as noop in DFA (Nonbacktracking) mode. |
namespace System.Text.RegularExpressions
{
public enum RegexOptions
{
...
NonBacktracking = 0x0400
}
} |
Background and motivation
The .NET regex engine today works based on an NFA/backtracking-based engine. For .NET 7, we're exploring adding an opt-in DFA-based engine. To opt-in, we'd use a new RegexOptions enum value.
API Proposal
Other name choices could include DFA, Deterministic, Safe, Predictable, ...
API Usage
Risks
It is opt-in as the engine will almost certainly have some limitations on what patterns are supported (e.g. look behinds), impact on capture semantics, etc.
The text was updated successfully, but these errors were encountered: