-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Add Regex.Enumerate(ReadOnlySpan<char>) which is allocation free #65011
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsBackground and motivationThis issue is breaking down one part of this original proposal: #59629 We have an ongoing Regex investment for .NET 7 which is adding span-based matching APIs that would incur in allocation free operations. The ones being worked on in #59629 encompass only the IsMatch overloads, which are APIs where the caller only cares about finding if the input is a match or not, but don't actually want the Match object back. We can't really provide APIs that work over
API Proposalnamespace System.Text.RegularExpressions
{
public class Regex
{
+ public MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input);
+ public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern);
+ public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+ public static MatchPositionEnumerator Enumerate(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
+ public ref struct MatchPositionEnumerator
+ {
+ public MatchPositionEnumerator GetEnumerator();
+ public bool MoveNext();
+ public Range Current { get; } // could also add a new struct for the return type here rather than System.Range
+ }
}
} API Usageforeach (Range r in Regex.Enumerate(spanInput, “pattern”))
{
Process(spanInput[r]);
} Alternative DesignsOne of the topics for discussion here is that the fact that the RisksWe want to make sure that by introducing this new API, we don't introduce confusion on whether this should be used over some other existing API like
|
I have done some experimentation on this, and have also discussed offline with both @stephentoub and @bartonjs. Based on this I have made a few changes on my prototype and I will update the above proposal as such. Here are the main points:
I will update the main description to apply the above suggestions. |
For Capture support, we can later propose something like: namespace System.Text.RegularExpressions
{
public readonly ref struct ValueMatch
{
public Range Range { get { throw null; } }
+ public Match? Match { get { throw null; } }
}
public class Regex : ISerializable
{
+ public ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures) { throw null; }
+ public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { throw null; }
+ public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options) { throw null; }
+ public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, bool includeCaptures, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options, TimeSpan matchTimeout) { throw null; }
} |
We broke Range up into Index and Length to better match the current Match type. We can easily add a Range property later if it's desired. The namespace System.Text.RegularExpressions
{
// NOTE: This is being approved as a ref struct because we have concerns that it will need it with
// a more complete implementation of ValueMatch.
// If the ref-ness can safely be removed, we'd rather this was a simple (non-ref) struct.
public readonly ref struct ValueMatch
{
public int Index { get; }
public int Length { get; }
}
public class Regex : ISerializable
{
public ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input) { throw null; }
public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { throw null; }
public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options) { throw null; }
public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, TimeSpan matchTimeout) { throw null; }
public ref struct ValueMatchEnumerator
{
public readonly ValueMatchEnumerator GetEnumerator() { throw null; }
public bool MoveNext() { throw null; }
public readonly ValueMatch Current { get { throw null; } }
}
}
} |
This issue is breaking down one part of this original proposal: #59629
cc: @stephentoub
Background and motivation
We have an ongoing Regex investment for .NET 7 which is adding span-based matching APIs that would incur in allocation free operations. The ones being worked on in #59629 encompass only the IsMatch overloads, which are APIs where the caller only cares about finding if the input is a match or not, but don't actually want the Match object back. We can't really provide APIs that work over
ReadOnlySpan<char>
, are alloc-free and return a Match object since the Match object holds the string that matched with the captures, and because Match are Object types, they can't hold a span as a field.Enumerate
would not permit access to the full list of groups and captures, just the index/offset of the top-level capture, but in doing so, theseEnumerate
methods can become amortized zero-alloc: the enumerator is a ref struct, no objects are yielded, the input is a span, and the matching engine can reuse the internalMatch
object (and its supporting arrays) just as is done today withIsMatch
to make it ammortized zero-alloc. If someone still needs the full details, they can fall back to using strings to begin with and the existing eitherMatch
orMatches
, or (for some patterns, e.g. ones that don’t have anchors or lookaheads or lookbehinds that might go beyond the matching boundaries) re-run the engine withMatch(matchString)
just on the string representing the area of the input that matched. (The trouble with addingRegex.Match
/Matches
overloads for spans is theMatch
andMatchCollection
types can’t store aSpan
; thus various surface area on these types couldn’t function with spans, likeNextMatch
… if we were to accept that, we could add span-based methods for those as well, but it would likely be confusing and inconsistent).API Proposal
API Usage
Alternative Designs
One of the topics for discussion here is that the fact that the
Enumerate
method won't give access to capture data might be confusing for consumers, as it is not plain and obvious that the intention of this method is to be allocation free, so it is possible that consumers might expect to get a Match enumerable back. One of the ideas that have been suggested by @stephentoub is that we could also provide ref struct versions of Match, Group and Capture which basically would reference theReadOnlySpan<char>
and that would be what is returned by Enumerate instead, which would make the return value be more intuitive and more what consumers might expect.Risks
We want to make sure that by introducing this new API, we don't introduce confusion on whether this should be used over some other existing API like
Matches
which already returns an enumerable of Match objects back. We have to make sure that we make it obvious when one should be used over the other.The text was updated successfully, but these errors were encountered: