Regex parser nits #88558

danmoseley · 2023-07-09T23:00:51Z

trivial changes.

I ran the real world patterns corpus through the parser and didn't notice anything that could be obviously improved except for one dictionary which was almost always resizing. Set size to eliminate resizing in 90% of cases.
Inlined some more low value helpers in RegexParser where I think it makes it clearer.
Marked methods readonly where our analyzer instructed it.
Some use of IndexOf() per feedback in Inline more helpers in RegexParser #88392

ghost · 2023-07-09T23:01:16Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

trivial changes.

I ran the real world patterns corpus through the parser and didn't notice anything that could be obviously improved except for one dictionary which was almost always resizing. Set size to eliminate resizing in 90% of cases.
Inlined some more low value helpers in RegexParser where I think it makes it clearer.
Marked methods readonly where our analyzer instructed it.

Author:	danmoseley
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

danmoseley · 2023-07-10T01:30:48Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs

@@ -1158,8 +1158,6 @@ private void ScanBlank()
        /// <summary>Scans \-style backreferences and character escapes</summary>
        private RegexNode? ScanBasicBackslash(bool scanOnly)
        {
-            Debug.Assert(_pos < _pattern.Length, "The current reading position must not be at the end of the pattern");


removed these asserts because there'll be a null ref a couple lines down in each case; we don't put asserts in front of every possible null ref.

Not sure I understood the comment, but it looks to me that these are asserting The current reading position not a null ref.

I should have said IndexOutOfRangeException

we don't put asserts in front of every possible null ref.

No, but we do put asserts in places that are meant to act as preconditions / contracts. These asserts make it really easy for a maintainer to see that these methods should only be called when there's pattern remaining. I don't think removing them was valuable.

danmoseley · 2023-07-14T12:28:07Z

Test failures unrelated

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs

stephentoub · 2023-07-17T13:34:58Z

I ran the real world patterns corpus through the parser and didn't notice anything that could be obviously improved except for one dictionary which was almost always resizing. Set size to eliminate resizing in 90% of cases.

I'm not seeing it... which line is the change to presize the dictionary?

danmoseley · 2023-07-17T17:24:14Z

I'm not seeing it... which line is the change to presize the dictionary?

Thanks for spotting that -- it got lost in a rebase, apparently:
https://github.com/dotnet/runtime/compare/main...danmoseley:regex.presize?expand=1

stephentoub · 2023-07-17T19:08:50Z

Thanks for spotting that -- it got lost in a rebase, apparently: https://github.com/dotnet/runtime/compare/main...danmoseley:regex.presize?expand=1

_stringTable = new Dictionary<string, int>(15); // avoids resize for 90% of real world patterns

How much of an overallocation is it for those? i.e. we could probably avoid it for 100% of the real-world patterns if we made it new Dictionary<string, int>(1_000_000), but that's obviously a bad answer.

Said another way, what does the histogram look like for number of strings involved in the various patterns?

danmoseley · 2023-07-17T20:07:20Z

https://gist.github.com/danmoseley/eeb38412d74c3eb22bc69472940b1f95

Note horizontal scale starts skipping around 60

danmoseley · 2023-07-17T20:08:48Z

For something like this, it would be nice if the dictionary could start with on stack buffers.

danmoseley · 2023-07-17T20:16:22Z

If the dictionary starts at zero, resizes when full to the next prime at least 2x the size, then it will have sizes 0, 3, 7, 17, 37, 89

61% would fit in 3, 80% in 7 and 96% in 17. Only 5% are zero and would not allocate at all.

0	12647	0.05	3
1	62735	0.31	3
2	43475	0.50	3
3	27852	0.61	3
4	17408	0.69	7
5	11870	0.74	7
6	8788	0.77	7
7	7623	0.80	7
8	7035	0.83	17
9	6381	0.86	17
10	5069	0.88	17
11	4178	0.90	17
12	3397	0.91	17
13	3023	0.93	17
14	2840	0.94	17
15	2243	0.95	17
16	2103	0.96	17
17	1800	0.96	17
18	1169	0.97	37
19	1360	0.97	37
20	1157	0.98	37
21	925	0.98	37
22	828	0.99	37
23	468	0.99	37
24	425	0.99	37
25	314	0.99	37
26	264	0.99	37
27	240	0.99	37
28	293	0.99	37
29	188	0.99	37
30	38	0.99	37
31	86	1.00	37

stephentoub · 2023-07-17T20:17:03Z

61% would fit in 3, 80% in 7 and 96% in 17. Only 5% are zero and would not allocate at all.

Right. So presizing to 15 would overallocate for the majority.

danmoseley · 2023-07-17T20:22:29Z

I guess it depends how you trade off a small transient allocation with a small CPU cost of copying. I do not think this is a hot path when aggregated with actually running the interpreter. But it seemed that we have some hard data here and it was easy to avoid that CPU cost in 90% of cases. I don't mind either way.

stephentoub · 2023-07-17T20:25:52Z

Thanks.

But it seemed that we have some hard data here and it was easy to avoid that CPU cost in 90% of cases.

There's CPU cost associated with overallocation as well.

I don't think presizing it to 15 is the right tradeoff.

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Jul 9, 2023

ghost assigned danmoseley Jul 9, 2023

This was referenced Jul 10, 2023

Intermittent build failure in AfterSourceBuild: "Could not write state file" #76488

Open

LibraryImportGenerator.Unit.Tests crashed in CI #87951

Closed

danmoseley commented Jul 10, 2023

View reviewed changes

danmoseley force-pushed the regex.presizedict branch from 446dc42 to 87951a3 Compare July 10, 2023 01:57

danmoseley added 15 commits July 9, 2023 20:59

inline stack

378c0b0

Inline CaptureSlotFromName()

81a1353

Inline HexDigit(c)

014ea9a

lint

d5608a4

inline OptionFromCode

af5ded3

inline IsCaptureName()

03ae945

build break

0268658

IndexOf in ScanReplacement()

1a13b44

Remove unnecessary clause

9f63918

Inline AddConcatenate(bool lazy, int min, int max)

1c1c49f

Change to IndexOf in 2 places

c65534f

Use StartsWith in 1 place

53a1fbe

remove low value assert

7883ae2

low value asserts

fbdad9b

minor simplify

5251fa4

danmoseley force-pushed the regex.presizedict branch from 87951a3 to 5251fa4 Compare July 10, 2023 02:01

Inline AddUnitToConcatenate()

5404f7f

danmoseley closed this Jul 10, 2023

revert starstwith

70bdb4f

danmoseley reopened this Jul 10, 2023

Update RegexParser.cs

550b255

build-analysis bot mentioned this pull request Jul 10, 2023

System.Text.RegularExpressions.Tests.AttRegexTests test fails in CI #75808

Closed

Update RegexParser.cs

7271a68

build-analysis bot mentioned this pull request Jul 14, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

danmoseley merged commit 24f5cbb into dotnet:main Jul 14, 2023