-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097
Comments
Current allocations are about 100MB initial string, 500MB intermediate strings, and about 100MB String builders (already pooled). The scheme above should cut this down to about 2 X100MB. |
Tricky part looks like what to do with Regex.Match, with Regex.Replace uses, as the string tendrils go pretty deep. I almost think it might be easiest to try the refactoring to use Span/Memory and see how far it can be done. It would likely remove strings from most of the insides of S.T.RE. That would help inform what APi is appropriate/possible. |
The current ArrayPool implementation does not pool buffers this big. |
What about making the pool size configurable by a setting switch in AppDomain.Current.SetData? |
I don't understand the proposal... What does this return? It's allocating a result rather than writing into a destination span? Is it's built on top of match, how are you going to deal with it creating heap-based objects that store the input, which would now include a span? Pin the span(s) and pass pointers around? What are the specific allocations being avoided here? |
I updated my initial post to provide a better understanding. PTAL.
That's the tricky part! |
Do you have an estimate on how much faster is the benchmark going to run if you get rid of garbage? |
char[] buffer1Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
char[] buffer2Arr = ArrayPool<char>.Shared.Rent(InputFileSize); Pooling is only worth it if there is something running a loop. Does the benchmark run in a loop? It does not look like it to me. |
GC Heap Analyzer shows 825ms was paused in GC. The benchmark currently takes ~15 seconds. |
There is ~800ms on ~15 sec scenario spent in GC. I would imagine some part of that. |
The benchmark does five Regex.Replace calls subsequently (a chain of pattern replacements). Even though it's currently not designed as, this is the loop scenario. |
That's a streaming scenario, given that the dataset is 50MB. What you would really want to make it go really fast is a streaming RegEx that you pass a stream in and it gives you a stream back. |
@jkotas interesting, how would a streaming API look? Note that the first replace removes newlines, so all subsequent regexes must operate on a single logical buffer. It would need a clever engine indeed to match against fragments of the buffer such that it had the same result. BTW, a .Replace that accepts a ROS as input seems uncontroversial - that would eliminate the string and StringBuilder involved in reading from the file..? |
Using the new shiny pipelines, it can be something like this:
This API would give you a great control over scheduling the 5 chained regex replace operations. E.g. it allows you to set it up such that second thread will start processing the second replace as soon as the first thread running the first replace starts producing output. |
Interesting. The regex work itself, in the general case, would need the entire input buffer to start work, |
The point of streaming regex engine is that it does not need to have the whole buffer available all the time. I understand that there may be non-trivial changes require in the current corefx regex engine to make it compatible streaming. BTW: Anybody happens to know whether there are good streaming engines out there? I would thing that grep or find/replace in Visual Studio may be implemented as streaming. |
PCRE has partial matching as does Boost (here). The idea seems to be if it says partial match, you feed it another block, and ask it to try again at the start position of the partial match. I could imagine that we could do such a thing - possibly substantial work. We would want to actually consume a stream, rather than just flag a partial match and let the caller deal with ti. |
A first implementation of a regex over stream could be done that was efficient for the case where there are reasonable length input lines and RegexOptions.Multiline is off (the default). It would read a line at a time from the stream. Of course that's fairly trivial to achieve with the current API. The problem with regex-redux is everything is one huge line. |
@powercode in #23602 you mentioned
Can you clarify what API you feel would improve that? Regex.Match creates strings, as you know - in the Match objects. Is merely accepting a Span still a win? |
Since our use case is the If there can be less allocations within the Match function, that is a bonus. So my scenario would benefit the most from having a Span for the match input. I haven't worked with the Span APIs yet, so I feel a bit reluctant to provide suggestions for API design, but as long as the scenario of matching every line in a file is considered, I am happy. By the way, PowerShell is currently lacking a |
One of the best streaming, high performance regex libraries is Hyperscan by Intel, although, it does not support complete regex syntax. It is licensed under BSD 3 clause license. |
Thank you all for your feedback. I will close this work item in favor of https://github.com/dotnet/corefx/issues/24145 where I'm currently drafting an API proposal for Span overloads on the existing API set. |
Rationale
Today we have the ability to replace certain parts of a string identified by a regex pattern by calling
Regex.Replace(string input, string pattern, string replacement)
. This does work well for small input strings but if you chain multiple Replace calls together and pass a large input string to it the number of string allocations are immensely high.Example
In the regex-redux benchmark there are 3 huge sets of allocations that dominate the rest.
API Proposal
Notes
This API would help us remove unnecessary string allocations in the regex-redux benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexredux&lang=csharpcore&id=9).
The Regex.Replace implementation uses Match internally to check the pattern on the input string. Can we also change the Match code to operate on Spans or do we need to allocate a string in that step?
Sample Usage
Relates to https://github.com/dotnet/corefx/issues/27124
cc @stephentoub @danmosemsft @vancem @jkotas @jfree
The text was updated successfully, but these errors were encountered: