Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex code optimization candidates #25043

Closed
4 of 7 tasks
ViktorHofer opened this issue Feb 14, 2018 · 16 comments
Closed
4 of 7 tasks

Regex code optimization candidates #25043

ViktorHofer opened this issue Feb 14, 2018 · 16 comments
Assignees
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@ViktorHofer
Copy link
Member

The following items are potential candidates for code optimization which could result in an increased performance / reduced allocation. Based on perfview data and manual code review.

Time

Allocations

In regex-redux there are 3 huge sets of allocations that dominate the rest

  • There is a cascade of replacements on a 50M character string, which creates 500MB of string garbage, presumably in LOH
    image
  • Its input file is read into a 100MB string. (double byte - 50MB file)
  • There are 100MB of StringBuilder allocations.

Could we use spans or pooled buffers or otherwise avoid these huge temporary strings?

Help is appreciated. If someone wants to collaborate I can share my perfview zip. I will extend this list over time.

FYI @stephentoub @jkotas @joshfree @vancem

@danmoseley
Copy link
Member

@dnickless I believe you mentioned in the past you were interested in perf optimizations?

@dnickless
Copy link
Contributor

Thanks, @danmosemsft, for pinging me. I don't have an awful lot of time these days so I can't promise anything but I will do my best.

@danmoseley
Copy link
Member

Although RTL is a global option only, it's used internally to implement lookbehind assertions. So it is a dynamic value. Updated text above.

@danmoseley
Copy link
Member

For the stringmatch case, I tried the diff below and it made perf_regex.match a few % slower:

--- a/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
+++ b/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
@@ -281,7 +281,7 @@ namespace System.Text.RegularExpressions
             else
             {
                 while (c != 0)
-                    if (str[--c] != _culture.TextInfo.ToLower(runtext[--pos]))
+                    if (str[--c] != runtext[--pos] && str[c] != _culture.TextInfo.ToLower(runtext[pos]))
                         return false;
             }

Still, @tarekgh it's a shame that TextInfo.ToLower(char) allocates a string (albeit 1 character string) even when the output is the same character.

@tarekgh
Copy link
Member

tarekgh commented Apr 2, 2018

@danmosemsft

@stephentoub talked to me this morning regarding the exact same issue and he mentioned going to look at it. I can help if @stephentoub don't have cycles for that.

@stephentoub
Copy link
Member

I have a fix I'm perf testing.

@danmoseley
Copy link
Member

danmoseley commented Apr 3, 2018

@stephentoub 's fix does not help ToLower(char). Perhaps stringmatch is as good as it can be right now.

@stephentoub
Copy link
Member

@stephentoub fix does not help ToLower(char)

That's because TextInfo.ToLower(char) already wasn't allocating 😄

@danmoseley
Copy link
Member

That's because TextInfo.ToLower(char) already wasn't allocating

I meant in wall-clock.

@Wraith2
Copy link
Contributor

Wraith2 commented Apr 6, 2018

I've been investigating the EnsureStorage function. While it seems to be possible to rewrite it to use stackalloc spans I'm not sure if the benefit is going to be worth the effort. Altering stack and track (at minimum, possibly crawl as well) to be span requires all the associated functions (push, pop, peek and overloads for all of them) to change to take the span parameters which is going to add a lot of extra parameter passing into the Go method.

It will also cause a public api change because the method is protected it will need to change from EnsureStorage() to EnsureStorage(ref Span stack, ref Span track).

At the moment that would only save allocation of 32+16 bytes at the default minimum sizes. I'm not sure what the maximum sensible stack allocation size would be before switching back to heap allocation but i can't see it making a lot of difference unless you're memory constrained and running quickly through matching in loops, which doesn't seems sensible in the first place.

@danmoseley
Copy link
Member

@Wraith2, @ViktorHofer can show you his private branch with his spanification work which may be interesting to you

@Wraith2
Copy link
Contributor

Wraith2 commented Apr 7, 2018

That may be instructive, I'd like to see that. I'm not sure how it would be possible to maintain the contract in a functional state while moving to spans for the initial storage.

@ViktorHofer
Copy link
Member Author

This is my private branch with the ongoing Spanification work: ViktorHofer/corefx#1

I don't think that a lot of time is really spent in EnsureStorage as the runner with its buffer is cached across its Regex instance. Only the first time when the buffer is empty or when the buffer is too small the DoubleStack and DoubleTrack helpers will be called. Its definitely problematic that the buffers and the EnsureStorage function is exposed (protected). In my branch you can see that I forked the RegexInterpreter away from the RegexRunner so that I'm able to modernize it without touching the exposed parts.

@Wraith2
Copy link
Contributor

Wraith2 commented May 28, 2018

Your branch is good progress towards modernising the entire assembly. I don't think there is any specific element in the current code that can be optimized without having an unfortunate effect elsewhere. In particular the tight coupling of the internals of the interpreter/runner and the use of private fields and methods through reflection make it very difficult to work with. I'm not sure this should be marked as up for grabs anymore.

@karelz
Copy link
Member

karelz commented May 29, 2018

@ViktorHofer can you please check if it is really appropriate for Hackathon? If yes, please unassign yourself & Dan + add comment about next steps.

@ViktorHofer
Copy link
Member Author

Indeed, this issue requires a lot of knowledge about the code base. Seems right to not mark it as a Hackathon candidate. I will still leave the up-for-grabs label as I would be super happy if community members find further optimization candidates.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests

8 participants