-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augment Regex extensibility point for better perf and span-based matching #59629
Comments
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsBackground and motivationThere are several related
On top of that, there are several long-standing issues to be addressed:
Each of these brings with it some difficulties that, altogether, suggest we augment how Now, there are a few relevant issues here:
These issues could all be addressed internally to the library, except that doing so can’t support the source generator, as doing so would require internal APIs that the code it generates can’t see. We thus need just enough new surface area to support this. We also want to address this in the same release as shipping the source generator initially; otherwise, any implementations generated based on .NET 7 would lack the support necessary to enable these things, and all such libraries would need to be recompiled to benefit, e.g. we could make an This issue mostly replaces #23602. However, that issue proposes additional span-based Split, Replace, and {Un}Escape methods, all of which could be built once this functionality is in place. API ProposalThis is a work-in-progress. We will need to prototype the APIs to confirm they’re both necessary and sufficient, and augment the proposal until they are. We should:
namespace System.Text.RegularExpressions
{
public class RegexRunner
{
+ protected virtual void Scan(ReadOnlySpan<char> text);
}
} The base implementation would effectively contain the existing bumpalong loop that invokes both namespace System.Text.RegularExpressions
{
public class RegexRunner
{
+ protected bool quick; // or `bool isMatch`
+ protected int timeoutMilliseconds;
+ protected int previousMatchLength;
}
} These are all inputs today to the internal
namespace System.Text.RegularExpressions
{
public class Regex
{
+ public bool IsMatch(ReadOnlySpan<char> input);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
}
}
namespace System.Text.RegularExpressions
{
public class Regex
{
+ public MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input);
+ public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern);
+ public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+ public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
+ public ref struct MatchPositionEnumerator
+ {
+ public MatchPositionEnumerator GetEnumerator();
+ public bool MoveNext();
+ public Range Current { get; } // could also add a new struct for the return type here rather than System.Range
+ }
}
}
API Usagebool matched = Regex.IsMatch(spanInput, "pattern");
...
foreach (Range r in Regex.Enumerate(spanInput, “pattern”))
{
Process(spanInput[r]);
} RisksNo response
|
@jeffhandley, this would be a good issue for someone as a way to ramp up on S.T.RE while also accomplishing several impactful results. It could be split into a handful of PRs, each moving the ball forward, e.g. PRs:
At this point, the internal implementations should be able to fully support span inputs (which can't be passed in yet), and it should be trivial to prototype the public surface area in order to confirm it works and get it approved. At which point...
|
Small question about the new static APIs: should they add several overloads (to stay consistent with the old API) or would it be better to add just one overload with optional parameters (to make the API simpler and easier to understand)? |
TimeSpan's can't be optional, and the TimeSpan matchTimeout should be at the end to be consistent with all the other methods; it could be made nullable, but that then introduces multiple values for infinite, which I don't love. Combine that with a desire for consistency across all the APIs, and my preference is to use multiple overloads instead of optional parameters here. But it's something that can be discussed in API review. |
Regarding this API suggestion, I suppose you meant |
Yes. (I left out the internal part as it's an implementation detail and not part of the public signature.) |
Another thing that would need to be exposed in order for this to work (with the source generator) I believe is something in The additional proposed field would be something like: namespace System.Text.RegularExpressions
{
public class RegexRunner
{
+ protected Match? scanMatchResult;
}
} |
@joperezr, can you help me understand why that's needed? You may very well be right, but I'm not seeing it. For example, SymbolicRegexRunner.Go is essentially the new Scan (albeit without the span argument) and doesn't need such a field: Line 90 in 1996f3d
It just uses Capture, and then the caller can see whether a match occurred by checking the state set by Capture. |
Looks like I got confused a bit and I was wrong about this assumption. That said, I did see that runmatch is not always set back to null after running Scan, for example, when Lines 190 to 194 in 1996f3d
As well as when using Split/Replace: Lines 308 to 313 in 1996f3d
In those cases, runmatch will be kept alive and reused and simply just reset to the right positions using: Lines 452 to 462 in 1996f3d
This may mean that it is ok for us to reuse since it will never be handed out to caller, so I will continue the prototype to see if it can be reused or if we would need the additional protected field in order to not break an existing scenario. |
I have updated the top level proposal and factored out the |
Looks good as proposed namespace System.Text.RegularExpressions
{
public class RegexRunner
{
- protected abstract bool FindFirstChar();
- protected abstract void Go();
- protected abstract void InitTrackCount();
+ protected virtual bool FindFirstChar() => throw new NotImplementedException(); //default implementation
+ protected virtual void Go() => throw new NotImplementedException(); //default implementation
+ protected virtual void InitTrackCount() => return 0; //default implementation
+ protected virtual void Scan(ReadOnlySpan<char> text);
}
public class Regex
{
+ public bool IsMatch(ReadOnlySpan<char> input);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+ public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
}
} |
Background and motivation
There are several related
Regex
efforts planned for .NET 7:On top of that, there are several long-standing issues to be addressed:
ReadOnlySpan<char>
inputs in addition to string inputs.Match
objects (and supporting data structures) for every match (on my 64-bit machine, a successful Match(string) with no subcaptures allocates three objects for a total of ~170b).Each of these brings with it some difficulties that, altogether, suggest we augment how
Regex
supports extensibility.Regex
enables compilation of regular expressions via an extensibility point (such compilation includes in-memory withRegexOptions.Compiled
,Regex.CompileToAssembly
on .NET Framework, and now theRegexGenerator
in .NET 7). Regex exposes several protected fields (!), one of which is of typeRegexRunnerFactory
. When the public APIs onRegex
need to perform a match, they do so via aRegexRunner
, and if noRegexRunner
is available (e.g. we’ve not performed a match yet), thatRegexRunnerFactory
is used to create one.RegexRunner
then has several abstract methods, the most important ones beingbool FindFirstChar()
andvoid Go()
. Thus, when we compile a regex, we need to produce aRegexRunner
-derived type that overridesFindFirstChar
/Go
, aRegexRunnerFactory
-derived type that overridesCreateInstance()
, and a Regex-derived type that sets the protected factory field to an instance of thisRegexRunnerFactory
. This is exactly what the source generator spits out for a given use of[RegexGenerator(…)]
.Now, there are a few relevant issues here:
RegexRunner
, and asFindFirstChar
/Go
are parameterless, we don’t have any way to pass a span into them. Thus, we currently lack a means for having the code generated in overrides of these from operating onReadOnlySpan<char>
. To match an inputReadOnlySpan<char>
, we could convert the inputReadOnlySpan<char>
into a string, but obviously that defeats the point.FindFirstChar
andGo
was likely predicated on the notion that there was no benefit to combining them because the employed matching algorithm logically splits them in this way: find a place that could match, try to match, find the next place, try to match, etc. But that is not how a DFA-based engine operates, where every character moves to the next state, whether that character is ultimately part of the match or not. This split is thus very artificial for a DFA-based matcher, which essentially has all of the logic in Go, with FindFirstChar effectively being a nop.These issues could all be addressed internally to the library, except that doing so can’t support the source generator, as doing so would require internal APIs that the code it generates can’t see. We thus need just enough new surface area to support this. We also want to address this in the same release as shipping the source generator initially; otherwise, any implementations generated based on .NET 7 would lack the support necessary to enable these things, and all such libraries would need to be recompiled to benefit, e.g. we could make an
IsMatch(ReadOnlySpan<char>)
work without recompilation, but it would involving converting the span to a string.This issue mostly replaces #23602. However, that issue proposes additional span-based Split, Replace, and {Un}Escape methods, all of which could be built once this functionality is in place.
API Proposal
We should:
namespace System.Text.RegularExpressions { public class RegexRunner { + protected virtual void Scan(ReadOnlySpan<char> text); } }
The base implementation would effectively contain the existing bumpalong loop that invokes both
FindFirstChar
andGo
. It would compare the provided span againstruntext
and setruntext
tospan.ToString()
only if they differ (as noted below, this base implementation will rarely be used and will be there purely for compat).Override this
Scan
method on all our implementations and in the source generator-generated code: the only time you’d get the base implementation would be if you were using an old .NET Framework-generatedCompileToAssembly
assembly, and the only time it would lead to inefficiencies is if you then used those in combination with the new span-based public APIs. The implementations ofFindFirstChar
/Go
for the interpreter,RegexOptions.Compiled
, and source generator would all be combined into this single method.NonBacktracking
effectively already has its combined.Make the existing
FindFirstChar
/Go
/InitTrackCount
methods virtual instead of abstract. The first two would throwNotImplementedExceptions
; the third would return0
. None of our implementations would need to override these methods anymore, justScan
.IsMatch
methods toRegex
(these are the same set of overloads as for string, but with aReadOnlySpan<char>
input):API Usage
Risks
No response
The text was updated successfully, but these errors were encountered: