Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RegexGenerator support for RegexOptions.NonBacktracking #61903

Open
stephentoub opened this issue Nov 22, 2021 · 9 comments
Open

Add RegexGenerator support for RegexOptions.NonBacktracking #61903

stephentoub opened this issue Nov 22, 2021 · 9 comments

Comments

@stephentoub
Copy link
Member

Currently, if you specify RegexOptions.NonBacktracking with the Regex source generator, e.g.

[RegexGenerator("abc", RegexOptions.NonBacktracking)]
private static partial Regex Example();

you get code like this:

    private static partial Regex Example() => GeneratedRegex_Example_A6B1D425.Instance;
    
    private sealed class GeneratedRegex_Example_A6B1D425 : Regex
    {
        public static Regex Instance { get; } = new Regex("abc", RegexOptions.NonBacktracking);
    }

In other words, you just get a cached instance of constructing Regex rather than having the implementation actually output in the source. That's because we currently special-case this flag:

/// <summary>Gets whether a given regular expression method is supported by the code generator.</summary>
private static bool SupportsCustomCodeGeneration(RegexMethod rm) =>
// The generator doesn't currently know how to emit code for NonBacktracking.
(rm.Options & RegexOptions.NonBacktracking) == 0;

due to it not yet being implemented:
/// <summary>Emits the body of a Go method supporting RegexOptions.NonBacktracking.</summary>
private static void EmitNonBacktrackingGo(IndentedTextWriter writer, RegexMethod rm, string id)
{
// TODO: Implement this and remove SupportsCustomCodeGeneration.
}

We should implement it 😄, at least for a subset of expressions, e.g. ones where we can fully construct a reasonably-sized DFA graph at build time and emit an executable representation of it into the source.

cc: @olsaarik, @veanes, @joperezr, @danmoseley

@stephentoub stephentoub added this to the 7.0.0 milestone Nov 22, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Nov 22, 2021
@ghost
Copy link

ghost commented Nov 22, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, if you specify RegexOptions.NonBacktracking with the Regex source generator, e.g.

[RegexGenerator("abc", RegexOptions.NonBacktracking)]
private static partial Regex Example();

you get code like this:

    private static partial Regex Example() => GeneratedRegex_Example_A6B1D425.Instance;
    
    private sealed class GeneratedRegex_Example_A6B1D425 : Regex
    {
        public static Regex Instance { get; } = new Regex("abc", RegexOptions.NonBacktracking);
    }

In other words, you just get a cached instance of constructing Regex rather than having the implementation actually output in the source. That's because we currently special-case this flag:

/// <summary>Gets whether a given regular expression method is supported by the code generator.</summary>
private static bool SupportsCustomCodeGeneration(RegexMethod rm) =>
// The generator doesn't currently know how to emit code for NonBacktracking.
(rm.Options & RegexOptions.NonBacktracking) == 0;

due to it not yet being implemented:
/// <summary>Emits the body of a Go method supporting RegexOptions.NonBacktracking.</summary>
private static void EmitNonBacktrackingGo(IndentedTextWriter writer, RegexMethod rm, string id)
{
// TODO: Implement this and remove SupportsCustomCodeGeneration.
}

We should implement it 😄, at least for a subset of expressions, e.g. ones where we can fully construct a reasonably-sized DFA graph at build time and emit an executable representation of it into the source.

cc: @olsaarik, @veanes, @joperezr, @danmoseley

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Dec 15, 2021
@stephentoub stephentoub modified the milestones: 7.0.0, Future Mar 11, 2022
@LennoxP90
Copy link

has any headway happened on this subject?

@stephentoub
Copy link
Member Author

has any headway happened on this subject?

No

@danmoseley
Copy link
Member

@LennoxP90 can you share what motivates your interest?

@LennoxP90
Copy link

just the non backtracking feature so we can have the best of both worlds fast startup and non backtracking support

@danmoseley
Copy link
Member

@LennoxP90 just curious what is your use case for non backtracking - defense in depth against flawed patterns, use of untrusted patterns, better perf in particular case, etc..

@danmoseley
Copy link
Member

(Your post makes complete sense of course just interested to learn more about users of this engine)

@LennoxP90
Copy link

i may not need it but i have a regex that determines if an input is a link or not here is my regex:
@"\b(?:https?://)\S+\b"
I just blanketly set all my regex to use non backtracking for a possible optimization maybe I am using the feature incorrectly I do not know.

@joperezr
Copy link
Member

joperezr commented Feb 4, 2023

The RegexOptions.NonBacktracking engine's main use case, is to ensure performance that it is linear to the length of the input, which is something that people care about in order to prevent catastrophic backtracking, also known as a ReDOS attack. In other words, there are patterns that have some characteristics which makes them prone to be exploited by specially crafted short inputs which can make a backtracking engine to loop excessively to the point it may crash the process. A non-backtracking engine on the other hand, won't loop in cycles the same way, so the only way to have one of those engines do more work, is literally by having a super large input.

That said, for the vast majority of cases a backtracking engine will be much faster into looking for a match (especially for the cases when a match exists) which is why most people that care about performance and know what their pattern is at compile time should opt for the source generated or compiled engine. For your specific pattern called above, there are only two loops, the optional s in https (which will be made an atomic loop so this makes this loop no longer relevant), and the \S+ part. One potential thing to consider is to remove the \b at the end, as the \S+ will already be a greedy loop and match as many non-whitespace characters as possible until it reaches a boundary. Another option is to mark that part as atomic loop so you prevent backtracking like (?>\S+). In any case, your pattern is not very likely to hit catastrophic backtracking (unless I'm missing something obvious) so if you want an optimized performance, you'd probably want to use the Source generated engine instead of RegexOptions.NonBacktracking.

I believe this is why @danmoseley above asked if your use case involved running unkown or untrusted patterns, which is a very valid reason why to use RegexOptions.NonBacktracking and sacrifice a bit of perf for cases there is a match, for the advantage of reducing the likelihood of having a ReDOS attack. Our colleague @stephentoub wrote a very good explanation of all of this and a description of the NonBacktracking feature in this blog post, I'd highly recommend checking it out 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants