Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex parser nits #88558

Merged
merged 20 commits into from
Jul 14, 2023
Merged

Regex parser nits #88558

merged 20 commits into from
Jul 14, 2023

Conversation

danmoseley
Copy link
Member

@danmoseley danmoseley commented Jul 9, 2023

trivial changes.

  1. I ran the real world patterns corpus through the parser and didn't notice anything that could be obviously improved except for one dictionary which was almost always resizing. Set size to eliminate resizing in 90% of cases.
  2. Inlined some more low value helpers in RegexParser where I think it makes it clearer.
  3. Marked methods readonly where our analyzer instructed it.
  4. Some use of IndexOf() per feedback in Inline more helpers in RegexParser #88392

@ghost
Copy link

ghost commented Jul 9, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

trivial changes.

  1. I ran the real world patterns corpus through the parser and didn't notice anything that could be obviously improved except for one dictionary which was almost always resizing. Set size to eliminate resizing in 90% of cases.
  2. Inlined some more low value helpers in RegexParser where I think it makes it clearer.
  3. Marked methods readonly where our analyzer instructed it.
Author: danmoseley
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@@ -1158,8 +1158,6 @@ private void ScanBlank()
/// <summary>Scans \-style backreferences and character escapes</summary>
private RegexNode? ScanBasicBackslash(bool scanOnly)
{
Debug.Assert(_pos < _pattern.Length, "The current reading position must not be at the end of the pattern");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these asserts because there'll be a null ref a couple lines down in each case; we don't put asserts in front of every possible null ref.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood the comment, but it looks to me that these are asserting The current reading position not a null ref.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have said IndexOutOfRangeException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't put asserts in front of every possible null ref.

No, but we do put asserts in places that are meant to act as preconditions / contracts. These asserts make it really easy for a maintainer to see that these methods should only be called when there's pattern remaining. I don't think removing them was valuable.

@danmoseley danmoseley closed this Jul 10, 2023
@danmoseley danmoseley reopened this Jul 10, 2023
@danmoseley
Copy link
Member Author

Test failures unrelated

@danmoseley danmoseley merged commit 24f5cbb into dotnet:main Jul 14, 2023
@danmoseley danmoseley deleted the regex.presizedict branch July 17, 2023 03:06
@stephentoub
Copy link
Member

I ran the real world patterns corpus through the parser and didn't notice anything that could be obviously improved except for one dictionary which was almost always resizing. Set size to eliminate resizing in 90% of cases.

I'm not seeing it... which line is the change to presize the dictionary?

@danmoseley
Copy link
Member Author

I'm not seeing it... which line is the change to presize the dictionary?

Thanks for spotting that -- it got lost in a rebase, apparently:
https://github.com/dotnet/runtime/compare/main...danmoseley:regex.presize?expand=1

@stephentoub
Copy link
Member

Thanks for spotting that -- it got lost in a rebase, apparently: https://github.com/dotnet/runtime/compare/main...danmoseley:regex.presize?expand=1

_stringTable = new Dictionary<string, int>(15); // avoids resize for 90% of real world patterns

How much of an overallocation is it for those? i.e. we could probably avoid it for 100% of the real-world patterns if we made it new Dictionary<string, int>(1_000_000), but that's obviously a bad answer.

Said another way, what does the histogram look like for number of strings involved in the various patterns?

@danmoseley
Copy link
Member Author

danmoseley commented Jul 17, 2023

https://gist.github.com/danmoseley/eeb38412d74c3eb22bc69472940b1f95

image

Note horizontal scale starts skipping around 60

@danmoseley
Copy link
Member Author

For something like this, it would be nice if the dictionary could start with on stack buffers.

@danmoseley
Copy link
Member Author

If the dictionary starts at zero, resizes when full to the next prime at least 2x the size, then it will have sizes 0, 3, 7, 17, 37, 89

61% would fit in 3, 80% in 7 and 96% in 17. Only 5% are zero and would not allocate at all.

0 12647 0.05 3
1 62735 0.31 3
2 43475 0.50 3
3 27852 0.61 3
4 17408 0.69 7
5 11870 0.74 7
6 8788 0.77 7
7 7623 0.80 7
8 7035 0.83 17
9 6381 0.86 17
10 5069 0.88 17
11 4178 0.90 17
12 3397 0.91 17
13 3023 0.93 17
14 2840 0.94 17
15 2243 0.95 17
16 2103 0.96 17
17 1800 0.96 17
18 1169 0.97 37
19 1360 0.97 37
20 1157 0.98 37
21 925 0.98 37
22 828 0.99 37
23 468 0.99 37
24 425 0.99 37
25 314 0.99 37
26 264 0.99 37
27 240 0.99 37
28 293 0.99 37
29 188 0.99 37
30 38 0.99 37
31 86 1.00 37

@stephentoub
Copy link
Member

61% would fit in 3, 80% in 7 and 96% in 17. Only 5% are zero and would not allocate at all.

Right. So presizing to 15 would overallocate for the majority.

@danmoseley
Copy link
Member Author

I guess it depends how you trade off a small transient allocation with a small CPU cost of copying. I do not think this is a hot path when aggregated with actually running the interpreter. But it seemed that we have some hard data here and it was easy to avoid that CPU cost in 90% of cases. I don't mind either way.

@stephentoub
Copy link
Member

Thanks.

But it seemed that we have some hard data here and it was easy to avoid that CPU cost in 90% of cases.

There's CPU cost associated with overallocation as well.

I don't think presizing it to 15 is the right tradeoff.

@danmoseley danmoseley changed the title Presize dictionary in regex interpreter Regex parser nits Jul 17, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Aug 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants