.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

kotlarmilos · 2023-04-04T15:04:52Z

This report provides an overview of the major performance improvements and regressions in WASM, Mono AOT, and Interpreter during the timeframe of .NET 8 per-preview releases. It focuses on relevant improvements and regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System	Bit	Processor Name
macOS 13.0	Arm64	Apple M1
ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Preview 7

The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The performance regressions and improvements are analyzed separately in #89238.

Mono Interpreter

The following sections presents improvements and regressions introduced in Interpreter in the Preview 7.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 7.

Name	Baseline Value	Compare Value	% Difference
PerfLabTests.EnumPerf.EnumEquals	646.25	229.29	-64.52
System.Tests.Perf_Enum.ToString_NonFlags_Small(value: TopDirectoryOnly)	633.28	235.90	-62.74
"System.Tests.Perf_Enum.ToString_Format_Flags_Large(value: All	format: ""g"")"	667.24	271.04
System.Reflection.Attributes.IsDefinedClassHitInherit	1315.59	562.93	-57.21
System.Reflection.Activator<EmptyStruct>.CreateInstanceGeneric	721.39	330.82	-54.14
System.Numerics.Tests.Perf_Vector4.SubtractOperatorBenchmark	20.82	9.59	-53.92
System.Reflection.Invoke.Method0_NoParms	853.86	399.59	-53.20
System.Numerics.Tests.Perf_Matrix4x4.CreateRotationZBenchmark	78.54	40.02	-49.03
System.Reflection.Attributes.IsDefinedMethodBaseMissInherit	2512.81	1431.26	-43.04
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByScalarBenchmark	183.31	106.83	-41.71
System.Tests.Perf_Enum.InterpolateIntoStringBuilder_Flags(value: 32)	7501.15	4383.76	-41.55
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark	189.92	111.79	-41.13
"System.IO.Tests.Perf_RandomAccess.ReadScatter(fileSize: 1048576	buffersSize: 16384	options: None)"	400115.22
System.Numerics.Tests.Perf_Matrix4x4.CreateRotationXWithCenterBenchmark	90.04	60.34	-32.98
"System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US	IgnoreCase	True))"	1024.28
"System.Tests.Perf_Enum.StringFormat(value: Red	Green)"	7002.80	4942.10
"System.Tests.Perf_Enum.ToString_Flags(value: Red	Orange	Yellow	Green
System.Numerics.Tests.Perf_VectorOf<Byte>.AddBenchmark	11.28	8.19	-27.44
System.Numerics.Tests.Perf_Vector4.DivideByScalarBenchmark	30.25	21.97	-27.36
System.Numerics.Tests.Perf_Vector2.EqualsBenchmark	35.85	27.68	-22.78

Vectorization of Vector4 in #87822 improved over 100 microbenchmarks in dotnet/perf-autofiling-issues#19758 and dotnet/perf-autofiling-issues#19760.

Fix path for empty partition in Enumerable.Select in #88425 improved EmptyTakeSelectToArray microbenchmarks as reported in dotnet/perf-autofiling-issues#19761.

Improved BigInteger operators +, - and * for trivial cases in #84733 improved some of BigInteger microbenchmarks in dotnet/perf-autofiling-issues#19762.

Precomputing the CallInfo structure in #88369 improved about 200 microbenchmarks.

The BCL change #86287 and vectorization of Vector128 in #88064 improved a dozen of Equals microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 7.

Name	Baseline Value	Compare Value	% Difference
System.Collections.CtorFromCollection<String>.FrozenDictionary(Size: 512)	44266.49	396363.53	795.40
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAllBenchmark	6.90	9.58	38.82
"Microsoft.Extensions.DependencyInjection.TimeToFirstService.Scoped(Mode: ""Expressions"")"	49567.25	65031.35	31.19
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.BitwiseOrOperatorBenchmark	9.62	12.45	29.41
System.Numerics.Tests.Perf_VectorOf<SByte>.OnesComplementOperatorBenchmark	6.04	7.80	29.23
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AllBitsSetBenchmark	2.04	2.61	28.32
System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000)	4495.94	5733.46	27.52
System.Memory.Span<Char>.SequenceEqual(Size: 33)	85.83	108.56	26.49
System.Numerics.Tests.Perf_VectorOf<Single>.AddOperatorBenchmark	7.67	9.58	24.98
"Microsoft.Extensions.DependencyInjection.TimeToFirstService.Scoped(Mode: ""ILEmit"")"	49928.88	62377.01	24.93
System.Memory.Constructors<String>.SpanFromArray	15.59	19.40	24.46
Microsoft.Extensions.DependencyInjection.ScopeValidation.TransientWithScopeValidation	1815.08	2227.85	22.74
System.Numerics.Tests.Perf_VectorOf<Int64>.EqualityOperatorBenchmark	6.56	7.77	18.48
System.IO.Tests.Perf_File.CopyToOverwrite(size: 4096)	47118.52	55507.12	17.80
"System.Tests.Perf_Decimal.TryParse(value: ""123456.789"")"	895.48	1023.98	14.34
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.AllBitsSetBenchmark	1.48	1.69	14.11
System.Numerics.Tests.Perf_VectorOf<UInt16>.AndNotBenchmark	9.16	10.44	13.96
System.Memory.Span<Byte>.IndexOfValue(Size: 33)	58.20	65.95	13.31
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.BitwiseOrOperatorBenchmark	7.62	8.61	12.96
"System.Tests.Perf_Int32.ParseSpan(value: ""2147483647"")"	206.91	233.69	12.94

Preview 6

The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT WASM

The following sections presents improvements and regressions introduced in Mono AOT WASM in the Preview 6.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 6.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark	0.38	0.00	-100
System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark	1.87	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.CountBenchmark	0.34	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark	0.22	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.InequalityOperatorBenchmark	0.97	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.CountBenchmark	0.29	0.00	-100
System.Tests.Perf_Enum.HasFlag	1.35	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualityOperatorBenchmark	2.28	0.01	<
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark	0.22	0.00	-99.57
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAllBenchmark	2.50	0.02	-99.35
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark	85.94	2.58	-97.00
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark	85.93	4.27	-95.02
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark	85.94	4.30	-94.99
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark	85.93	4.35	-94.94
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualBenchmark	2.91	0.26	-91.04
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmark	2.26	0.25	-88.80
System.Numerics.Tests.Perf_Vector3.UnitZBenchmark	3.84	0.54	-85.93
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.BitwiseAndBenchmark	4.07	0.69	-83.07
System.Runtime.Intrinsics.Tests.Perf_Vector128.FloorFloatBenchmark	20.82	3.59	-82.73
System.Net.Primitives.Tests.IPAddressPerformanceTests.TryWriteBytes(address: 1020:3040:5060:7080:9010:1112:1314:1516)	78.86	13.78	-82.52

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 6.

Name	Baseline Value	Compare Value	% Difference
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark	0.00	0.14	26004.19
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark	0.00	0.07	12106.45
System.Numerics.Tests.Perf_VectorOf<Double>.CountBenchmark	0.09	3.36	3767.73
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark	0.00	0.06	2106.86
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.AllBitsSetBenchmark	1.95	10.77	452.08
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark	0.00	0.01	405.57
System.Numerics.Tests.Perf_VectorOf<UInt16>.MaxBenchmark	0.75	3.50	365.24
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.DotBenchmark	0.87	3.58	312.42
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualBenchmark	0.92	3.67	300.46
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualBenchmark	0.92	3.55	286.90
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.DotBenchmark	0.78	2.61	236.42
System.Numerics.Tests.Perf_VectorOf<SByte>.OnesComplementOperatorBenchmark	0.75	2.51	236.33
System.Numerics.Tests.Perf_VectorOf<SByte>.BitwiseOrBenchmark	2.62	8.52	225.70
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.ZeroBenchmark	2.00	5.96	198.55
System.Numerics.Tests.Perf_VectorOf<Int64>.ZeroBenchmark	1.98	5.88	196.21
System.Numerics.Tests.Perf_VectorOf<UInt16>.MultiplyBenchmark	3.10	9.12	194.26
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark	0.98	2.75	180.71
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark	0.98	2.69	174.16
System.Numerics.Tests.Perf_VectorOf<SByte>.UnaryNegateOperatorBenchmark	1.08	2.80	159.06
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MinBenchmark	2.70	6.92	156.32

Mono AOT compiler

The performance regressions and improvements are analyzed separately in #89238.

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 6.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 6.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Double>.CountBenchmark	0.00	0.00	-100
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark	0.02	0.00	-100
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark	0.00	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark	0.40	0.00	-100
System.Numerics.Tests.Perf_VectorOf<SByte>.OneBenchmark	76.06	1.57	-97.93
System.Numerics.Tests.Perf_VectorOf<Byte>.OneBenchmark	76.01	1.87	-97.53
System.Numerics.Tests.Perf_VectorOf<SByte>.NegateBenchmark	221.32	6.26	-97.16
System.Numerics.Tests.Perf_VectorOf<SByte>.UnaryNegateOperatorBenchmark	221.61	6.27	-97.16
System.Numerics.Tests.Perf_VectorOf<Byte>.UnaryNegateOperatorBenchmark	214.44	6.20	-97.10
System.Numerics.Tests.Perf_VectorOf<Byte>.NegateBenchmark	214.55	6.37	-97.02
System.Numerics.Tests.Perf_VectorOf<SByte>.SubtractBenchmark	231.29	7.90	-96.58
System.Numerics.Tests.Perf_VectorOf<SByte>.SubtractionOperatorBenchmark	221.04	7.90	-96.42
System.Numerics.Tests.Perf_VectorOf<UInt16>.OneBenchmark	50.92	1.83	-96.41
System.Numerics.Tests.Perf_VectorOf<Byte>.AddBenchmark	216.21	7.83	-96.37
System.Numerics.Tests.Perf_VectorOf<Byte>.SubtractBenchmark	214.79	7.79	-96.37
System.Numerics.Tests.Perf_VectorOf<Byte>.SubtractionOperatorBenchmark	215.60	7.92	-96.32
System.Numerics.Tests.Perf_VectorOf<SByte>.MultiplyOperatorBenchmark	225.86	8.35	-96.30
System.Numerics.Tests.Perf_VectorOf<Byte>.AddOperatorBenchmark	209.41	7.95	-96.20
System.Numerics.Tests.Perf_VectorOf<SByte>.MultiplyBenchmark	217.21	8.39	-96.13
System.Numerics.Tests.Perf_VectorOf<SByte>.AddOperatorBenchmark	214.44	8.33	-96.11

Vectorization of Vector<T> operators in dotnet/perf-autofiling-issues#18537 improved over 200 microbenchmarks.

Changes in #87219 introduced Math.BigMul in NextUInt64 random method and improved several microbenchmarks reported in dotnet/perf-autofiling-issues#18690.

About 120 microbenchmarks were improved dotnet/perf-autofiling-issues#19027 potentialy by #87555 or other interpreter and BCL changes.

Fozen dictionary creation is improved by 72% in #87510.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 6.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Int64>.CountBenchmark	0.01	0.23	2775.54
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark	0.01	0.17	2177.17
System.Numerics.Tests.Perf_VectorOf<UInt16>.ZeroBenchmark	2.24	4.95	121.29
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmark	7.65	16.63	117.46
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.OnesComplementOperatorBenchmark	3.03	6.11	101.75
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark	0.04	0.08	86.25
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark	18.37	33.12	80.26
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Validated(ssl: True, chunkedResponse: False, responseLength: 100000)"	2230622.93	3965252.94	77.76
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark	0.12	0.20	69.81
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 100000)"	2181340.94	3635706.61	66.67
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark	18.27	30.07	64.56
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark	1.36	2.10	55.23
HardwareIntrinsics.RayTracer.SoA.Render	1.15	1.76	52.81
System.Numerics.Tests.Perf_Vector2.DivideByScalarBenchmark	13.77	20.17	46.46
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: True, responseLength: 100000)"	2621801.93	`3807493`.79	45.22
System.Runtime.Intrinsics.Tests.Perf_Vector128.ConvertDoubleToLongBenchmark	64.48	89.74	39.17
System.Linq.Tests.Perf_Enumerable.WhereSingleOrDefault_LastElementMatches(input: Array)	2714.67	3708.23	36.59
System.Memory.Constructors_ValueTypesOnly<Byte>.SpanFromPointerLength	6.95	9.47	36.28
Span.IndexerBench.CoveredIndex3(length: 1024)	16595.22	22106.92	33.21
"System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)"	867.68	1154.02	33.00

Preview 5

There are a number of improvements introduced in Preview 5 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The performance regressions and improvements are analyzed separately in #89238.

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 5.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 5.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark	0.18	0.00	-100
System.Numerics.Tests.Perf_VectorOf<UInt16>.CountBenchmark	0.10	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark	0.01	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.CountBenchmark	0.03	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark	1.12	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.CountBenchmark	0.22	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.CountBenchmark	0.08	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.CountBenchmark	0.48	0.00	-99.74
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark	0.14	0.00	-99.30
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark	2.36	0.12	-95.07
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivideBenchmark	127.11	7.82	-93.85
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyOperatorBenchmark	123.89	7.68	-93.80
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyBenchmark	126.45	7.94	-93.71
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyOperatorBenchmark	125.08	7.87	-93.70
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivisionOperatorBenchmark	123.79	7.83	-93.67
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivideBenchmark	126.19	8.05	-93.62
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyBenchmark	127.05	8.23	-93.52
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivisionOperatorBenchmark	123.95	8.22	-93.37
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark	0.06	0.01	-86.49
System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000)	483385521.57	66414495.75	-86.26

Vectorization of IndexOf in #85437 improved System.Text.RegularExpressions microbenchmarks reported in dotnet/perf-autofiling-issues#17517. Addition of Vector128 and PackedSimd in #82773 improved about 70 microbenchmarks reported in dotnet/perf-autofiling-issues#17563 and dotnet/perf-autofiling-issues#17819.

Change in Plane and Quaternion improved several microbenchmarks in dotnet/perf-autofiling-issues#18043.

Change in #85528 addressed performance problems with code like EqualityComparer<T>.Default.Equals() which improved over 200 microbenchmarks reported in dotnet/perf-autofiling-issues#18349. Implementation of float32 Vector128.Equals intrnsic improved System.Numerics.Tests microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 5.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_Vector2.ZeroBenchmark	0.03	1.05	3076.49
System.Numerics.Tests.Perf_VectorOf<Double>.ZeroBenchmark	2.96	9.10	207.86
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.BitwiseOrOperatorBenchmark	8.51	21.64	154.37
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAnyBenchmark	24.29	47.23	94.44
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.InequalityOperatorBenchmark	3.94	7.15	81.24
System.Numerics.Tests.Perf_Plane.CreateFromVerticesBenchmark	76.92	132.40	72.12
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.ConditionalSelectBenchmark	11.14	17.45	56.64
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False)	1877.78	2918.99	55.44
System.Diagnostics.Perf_Process.StartAndWaitForExit	1286337.51	1968645.19	53.04
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAllBenchmark	24.23	36.78	51.79
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.ZeroBenchmark	2.99	4.47	49.41
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.SubtractionOperatorBenchmark	7.62	11.13	45.99
System.Memory.Span<Char>.Reverse(Size: 512)	789.89	1116.00	41.28
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False)	1963.38	2745.38	39.82
System.Numerics.Tests.Perf_VectorOf<Single>.LessThanAllBenchmark	59.72	82.75	38.57
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualityOperatorBenchmark	27.40	37.64	37.35
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, None, False))	6382.39	8678.93	35.98
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.OnesComplementBenchmark	6.38	8.61	34.98
System.Numerics.Tests.Perf_VectorOf<Int64>.ZeroBenchmark	2.81	3.78	34.72
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAllBenchmark	26.61	35.79	34.51

Preview 4

There are a number of improvements introduced in Preview 4 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 4.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 4.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark	0.01	0.00	-100
System.Numerics.Tests.Perf_VectorOf<UInt16>.CountBenchmark	0.01	0.00	-100
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark	0.01	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.CountBenchmark	0.01	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark	0.01	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark	0.01	0.00	-100
System.Tests.Perf_DateTime.ToString(format: "s")	417.41	103.88	-75.11
System.Tests.Perf_DateTimeOffset.ToString(format: "s")	431.57	114.37	-73.49
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 100000)	25903.87	7803.06	-69.87
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000)	25653.57	7923.08	-69.11
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000000)	24916.24	7700.13	-69.09
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 1000000)	25328.88	7962.83	-68.56
System.Collections.Tests.Add_Remove_SteadyState<Int32>.Queue(Count: 512)	18.37	8.31	-54.78
System.Threading.Tests.Perf_Volatile.Read_double	0.26	0.12	-53.92
System.Numerics.Tests.Perf_VectorOf<Byte>.ZeroBenchmark	5.66	2.67	-52.77
System.Net.Primitives.Tests.IPAddressPerformanceTests.TryFormat(address: 1020:3040:5060:7080:9010:1112:1314:1516)	243.27	128.93	-46.99
System.Numerics.Tests.Perf_Vector3.DistanceSquaredBenchmark	16.92	9.15	-45.90
System.Numerics.Tests.Perf_Vector3.DistanceBenchmark	23.13	13.70	-40.79
PerfLabTests.EnumPerf.ObjectGetType	0.03	0.02	-38.31
System.Numerics.Tests.Perf_Vector3.DivideByVector3OperatorBenchmark	17.44	10.91	-37.47

BCL changes in #84210 and #84210 improved Guid.Parse and vectorized all sets in Regex, as reported in dotnet/perf-autofiling-issues#15183 and dotnet/perf-autofiling-issues#15177.

Implementation of fast path for mini_init_method_rgctx in #84226 improved over 50 microbenchmarks reported in dotnet/perf-autofiling-issues#15717, dotnet/perf-autofiling-issues#15796, and dotnet/perf-autofiling-issues#15799.

Intrinsics get_Count and get_AllBitsSet on arm64 improved around 400 microbenchmarks, as reported in dotnet/perf-autofiling-issues#15800, dotnet/perf-autofiling-issues#15718, and dotnet/perf-autofiling-issues#15797.

Allow inlining methods containing constructor calls and Intrinsified additional calls to Type:op_Equality improved over 100 microbenchmarks reported in dotnet/perf-autofiling-issues#16371 and dotnet/perf-autofiling-issues#16509.

V128 SIMD intrinsics on Arm64 across all codegen engines in #84289 improved over 400 microbenchmarks reported in dotnet/perf-autofiling-issues#16460, dotnet/perf-autofiling-issues#16621, and dotnet/perf-autofiling-issues#16660. Adding Vector128.ConvertXX and Vector128.Create as intrinsics on arm64 improved 48 microbenchmarks reported in dotnet/perf-autofiling-issues#17314 and in dotnet/perf-autofiling-issues#17315.

Make Guid.HexsToChars aggressively inlined in #85322 improved a couple of microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 4.

Name	Baseline Value	Compare Value	% Difference
System.Tests.Perf_String.Substring_IntInt(s: "dzsdzsDDZSDZSDZSddsz", i1: 7, i2: 4)	23.92	42.38	77.13
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)	14.05	23.66	68.37
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt32(value: 4)	13.98	22.92	64.00
Benchstone.BenchI.IniArray.Test	186909527.87	304502098.85	62.91
Span.IndexerBench.Ref(length: 1024)	686.54	1110.42	61.74
System.Tests.Perf_Int64.TryParse(value: "9223372036854775807")	58.15	93.40	60.60
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.DivideBenchmark	23.30	37.16	59.44
System.Tests.Perf_Int64.TryParse(value: "-9223372036854775808")	59.06	93.58	58.45
System.Tests.Perf_Int64.TryParseSpan(value: "9223372036854775807")	59.71	93.89	57.26
System.Buffers.Binary.Tests.BinaryReadAndWriteTests.MeasureReverseUsingNtoH	1432.42	2191.50	52.99
System.Tests.Perf_Int64.TryParseSpan(value: "-9223372036854775808")	61.80	94.18	52.39
System.Threading.Tests.Perf_Volatile.Write_double	0.23	0.35	52.13
System.Numerics.Tests.Perf_VectorOf<Int32>.EqualsBenchmark	0.81	1.23	50.47
System.Tests.Perf_String.Trim(s: "Test ")	76.12	113.79	49.48
System.Tests.Perf_UInt16.Parse(value: "12345")	35.63	52.72	47.98
System.Tests.Perf_Int64.Parse(value: "-9223372036854775808")	62.30	91.72	47.22
System.Tests.Perf_UInt64.Parse(value: "18446744073709551615")	70.51	103.27	46.44
System.Tests.Perf_Int64.Parse(value: "9223372036854775807")	61.62	90.17	46.34
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.SumBenchmark	2.76	3.99	44.34
System.Collections.Tests.Perf_BitArray.BitArrayGet(Size: 512)	8039.61	11602.79	44.32

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 4.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 4.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark	0.00	0.00	-100
System.Numerics.Tests.Perf_VectorOf<Int16>.CountBenchmark	0.18	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.CountBenchmark	0.16	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark	1.29	0.00	-100
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark	0.20	0.00	-99.20
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark	0.07	0.00	-95.73
System.Tests.Perf_DateTime.ToString(format: "s")	2233.23	281.76	-87.38
System.Text.Json.Serialization.Tests.ColdStartSerialization<SimpleStructWithProperties>.NewJsonSerializerContext	185975.98	28969.63	-84.42
System.Tests.Perf_DateTimeOffset.ToString(format: "s")	2311.74	385.39	-83.32
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark	0.44	0.10	-77.43
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000000)	45039.52	12494.67	-72.25
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000)	44649.63	12502.98	-71.99
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 1000000)	45124.15	13007.76	-71.17
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 100000)	44604.36	13258.02	-70.27
System.Reflection.Invoke.Ctor0_NoParams	393.98	123.35	-68.69
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark	0.00	0.00	-68.38
System.Tests.Perf_DateTimeOffset.ToString(format: null)	6639.43	2509.03	-62.21
System.Reflection.Activator<EmptyClass>.CreateInstanceGeneric	575.27	221.73	-61.45
System.Tests.Perf_DateTimeOffset.ToString(value: 12/30/2017 3:45:22 AM -08:00)	6959.23	2746.69	-60.53
System.Memory.ReadOnlySpan.Trim(input: "")	49.19	19.80	-59.73

Implementation of IUtf8SpanFormattable in #84469 caused both improvements and regressions as reported in dotnet/perf-autofiling-issues#15630 and dotnet/perf-autofiling-issues#15626. DateTime{Offset} formatting improvement about 120 microbenchmarks reported in dotnet/perf-autofiling-issues#17009. PR #85288 improved about 30 microbenchmarks reported in dotnet/perf-autofiling-issues#17245. Handling of the Utf8Formatter.TryFormat and then delegating to the relevant helpers in #85277 improved about 30 microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 4.

Name	Baseline Value	Compare Value	% Difference
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.CountBenchmark	0.00	0.23	9893.94
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark	0.02	0.75	4216.78
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark	0.00	0.12	3988.20
Microsoft.Extensions.DependencyInjection.ActivatorUtilitiesBenchmark.Factory	276.60	852.40	208.17
System.Numerics.Tests.Perf_VectorOf<UInt64>.AbsBenchmark	2.32	4.51	94.06
System.Numerics.Tests.Perf_VectorOf<UInt16>.AbsBenchmark	2.37	4.34	83.29
System.Numerics.Tests.Perf_Vector2.ZeroBenchmark	0.44	0.78	78.01
System.Memory.Constructors<Byte>.ArrayAsSpan	12.20	21.63	77.34
Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Indexer_FirstElement_String	8.60	14.85	72.68
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 100000)	1903905.78	3227992.49	69.54
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.OnesComplementBenchmark	6.62	10.83	63.43
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterDecimal(value: 123456.789)	491.42	801.06	63.00
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.OnesComplementOperatorBenchmark	6.29	10.12	60.75
Microsoft.AspNetCore.Server.Kestrel.Performance.PipeThroughputBenchmark.Parse_ParallelAsync(Length: 4096, Chunks: 1)	8112.10	12805.61	57.85
System.Memory.Constructors<Byte>.MemoryMarshalCreateReadOnlySpan	7.75	12.19	57.15
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark	0.12	0.19	54.21
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.BitwiseAndBenchmark	8.47	12.73	50.32
System.Numerics.Tests.Constructor.ConstructorBenchmark_Int16	29.48	43.17	46.45
System.Numerics.Tests.Perf_VectorOf<UInt16>.InequalityOperatorBenchmark	19.53	27.98	43.23
System.Numerics.Tests.Perf_VectorOf<UInt64>.BitwiseOrBenchmark	39.39	55.74	41.51

Preview 3

The following section overviews only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 3.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 3.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark	0.01	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark	0.01	0.00	-100
System.Tests.Perf_Boolean.ToString(value: True)	0.23	0.00	-100
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark	1.96	0.80	-59.04
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.SumBenchmark	6.65	3.26	-50.93
System.Numerics.Tests.Perf_Vector4.InequalityOperatorBenchmark	1.39	0.74	-46.53
System.Tests.Perf_Enum.HasFlag	0.23	0.13	-44.47
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint	1096.23	667.83	-39.07
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong	1102.75	746.09	-32.34
System.Numerics.Tests.Perf_BitOperations.Log2_ulong	1320.59	895.14	-32.21
System.Tests.Perf_String.IndexerCheckLengthHoisting	88.84	60.29	-32.13
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAllBenchmark	4.44	3.03	-31.65
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.SumBenchmark	4.02	2.76	-31.25
System.Numerics.Tests.Perf_VectorOf<SByte>.MinBenchmark	48.27	33.34	-30.93
Inlining.InlineGCStruct.WithFormat	2.86	1.99	-30.52
PerfLabTests.CastingPerf.ObjScalarValueType	108762.72	76497.64	-29.66
System.Numerics.Tests.Perf_VectorOf<Byte>.InequalityOperatorBenchmark	0.55	0.39	-29.07
Microsoft.Extensions.Primitives.StringSegmentBenchmark.Equals_Object_Invalid	2.86	2.04	-28.66
System.Numerics.Tests.Perf_VectorOf<UInt64>.EqualityOperatorBenchmark	0.52	0.37	-28.49
System.Numerics.Tests.Perf_VectorOf<UInt64>.InequalityOperatorBenchmark	0.62	0.45	-28.32

The most improved groupings of benchmark are System.Numerics as outlined dotnet/perf-autofiling-issues#14023, dotnet/perf-autofiling-issues#14224, dotnet/perf-autofiling-issues#14573, and dotnet/perf-autofiling-issues#14322. The changes implemented in #82420, #83337, and #83094 introduced Arm64 SIMD operations and improved about 1000 microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 3.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.ZeroBenchmark	2.65	5.66	113.78
System.Numerics.Tests.Perf_BitOperations.Log2_uint	791.53	1539.09	94.44
System.Collections.Tests.Add_Remove_SteadyState<Int32>.Queue(Count: 512)	9.64	18.37	90.64
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000)	2769.97	5142.05	85.63
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000)	2771.03	5139.62	85.47
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 100)	377.30	646.53	71.35
System.Numerics.Tests.Perf_BitOperations.PopCount_uint	668.42	1104.04	65.17
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 100)	377.61	598.53	58.50
System.Threading.Tests.Perf_Volatile.Read_double	0.16	0.26	57.96
System.Memory.Span<Char>.Reverse(Size: 512)	258.69	407.47	57.51
PerfLabTests.LowLevelPerf.StructWithInterfaceInterfaceMethod	154024.04	239168.34	55.27
System.Text.Json.Tests.Perf_Segment.ReadSingleSegmentSequenceByN(numberOfBytes: 8192, TestCase: Json4KB)	13635.35	20935.97	53.54
System.Text.Json.Tests.Perf_Reader.ReadSpanEmptyLoop(IsDataCompact: True, TestCase: Json4KB)	10415.86	15732.85	51.04
System.Text.Json.Tests.Perf_Reader.ReadSingleSpanSequenceEmptyLoop(IsDataCompact: True, TestCase: Json4KB)	10436.16	15712.23	50.55
System.Numerics.Tests.Perf_VectorOf<Int32>.EqualityOperatorBenchmark	0.24	0.36	50.01
System.Collections.IndexerSetReverse.Array(Size: 512)	456.86	681.13	49.08
System.Collections.IndexerSet<Int32>.Span(Size: 512)	458.27	682.26	48.87
System.Numerics.Tests.Perf_VectorOf<Int64>.EqualityOperatorBenchmark	0.27	0.40	48.57
System.Numerics.Tests.Perf_BitOperations.PopCount_ulong	745.13	1102.84	48.00
System.Text.Json.Tests.Perf_Reader.ReadReturnBytes(IsDataCompact: False, TestCase: Json40KB)	158074.36	231420.75	46.39

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 3.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 3.

Name	Baseline Value	Compare Value	% Difference
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark	0.16	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark	0.01	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.CountBenchmark	0.11	0.00	-100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.CountBenchmark	0.43	0.00	-100
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+')	99861.87	2074.68	-97.92
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark	2.79	0.07	-97.41
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark	234.80	6.26	-97.33
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark	246.33	6.63	-97.30
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.NegateBenchmark	235.81	6.49	-97.24
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.NegateBenchmark	235.54	6.56	-97.21
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark	3.10	0.09	-97.00
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanBenchmark	273.32	8.63	-96.84
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanBenchmark	273.20	8.91	-96.73
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsStaticBenchmark	273.84	9.19	-96.64
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractBenchmark	247.26	8.65	-96.50
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanBenchmark	250.97	8.85	-96.47
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractBenchmark	244.27	8.76	-96.41
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyOperatorBenchmark	249.17	8.97	-96.40
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddBenchmark	238.40	8.67	-96.36
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddOperatorBenchmark	236.35	8.68	-96.32

The most improved groupings of benchmark are System.Buffers, System.Collections, System.Memory, and System.Text as outlined in dotnet/perf-autofiling-issues#14324, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14326, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14355, dotnet/perf-autofiling-issues#14359, and dotnet/perf-autofiling-issues#14361. The changes implemented in #83498 and #83490 increased inlining length limit from 20 to 30 and implemented shr.un.imm which improved over 1000 microbenchmarks.

Add vector horizontal sums on Arm64 #83675 improved about 20 microbenchmarks, as detailed in dotnet/perf-autofiling-issues#14531.

Changes in #83512 caused both improvements and regressions as reported in dotnet/perf-autofiling-issues#15008 and dotnet/perf-autofiling-issues#15154.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 3.

Name	Baseline Value	Compare Value	% Difference
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark	0.00	0.12	661187.94
System.Numerics.Tests.Perf_VectorOf<Int16>.CountBenchmark	0.01	0.18	2061.26
System.Numerics.Tests.Perf_Vector3.EqualsBenchmark	23.78	443.27	1764.35
System.Numerics.Tests.Perf_Vector4.EqualsBenchmark	24.01	406.03	1590.83
System.Numerics.Tests.Perf_Vector2.EqualsBenchmark	33.71	435.39	1191.71
System.Numerics.Tests.Perf_Matrix3x2.EqualsBenchmark	162.13	1346.77	730.69
System.Numerics.Tests.Perf_Plane.EqualsBenchmark	57.84	411.46	611.36
System.Numerics.Tests.Perf_Quaternion.EqualsBenchmark	80.35	436.94	443.80
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark	0.04	0.20	431.24
System.Numerics.Tests.Perf_Matrix4x4.EqualsBenchmark	376.19	1808.21	380.66
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark	0.99	2.52	154.02
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsBenchmark	124.90	305.09	144.27
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark	0.19	0.44	127.07
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsBenchmark	191.86	410.58	113.99
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsBenchmark	199.71	410.56	105.57
System.Threading.Tests.Perf_Thread.CurrentThread	3.50	6.37	81.95
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Unvalidated(ssl: True, chunkedResponse: True, responseLength: 100000)	1951914.28	3529445.53	80.81
System.Text.Json.Serialization.Tests.ReadJson<BinaryData>.DeserializeFromReader(Mode: SourceGen)	33011.31	59326.04	79.71
System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US, OrdinalIgnoreCase, False))	913.26	1618.90	77.26
System.Text.Json.Serialization.Tests.ReadJson<BinaryData>.DeserializeFromReader(Mode: Reflection)	32968.66	58440.45	77.26

Preview 2

There are a number of improvements introduced in Preview 2 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512)	34.07 μs	310.43 ns	-33756.76 ns	99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512)	17.32 μs	314.25 ns	-17007.28 ns	98%
System.Tests.Perf_Decimal.Floor	81.17 ns	16.81 ns	-64.36 ns	79%
System.Tests.Perf_Decimal.Round	82.24 ns	18.69 ns	-63.55 ns	77%
System.Tests.Perf_UInt32.TryFormat(value: 0)	78.23 ns	20.05 ns	-58.18 ns	74%
System.Tests.Perf_Int32.TryFormat(value: 4)	78.02 ns	20.47 ns	-57.55 ns	74%
System.Collections.TryGetValueFalse<String, String>.ConcurrentDictionary(Size: 512)	44.69 μs	12.92 μs	-31.77 μs	71%
System.Tests.Perf_Decimal.Divide	346.08 ns	102.16 ns	-243.92 ns	70%
System.Collections.ContainsKeyFalse<String, String>.ConcurrentDictionary(Size: 512)	45.29 μs	13.50 μs	-31.79 μs	70%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000)	8.93 μs	2.77 μs	-6.16 μs	69%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000)	8.83 μs	2.77 μs	-6.06 μs	69%
System.Tests.Perf_UInt64.TryFormat(value: 0)	84.40 ns	26.53 ns	-57.87 ns	69%
System.Tests.Perf_Byte.ToString(value: 255)	91.65 ns	29.95 ns	-61.69 ns	67%
System.Tests.Perf_Version.TryFormat3	265.42 ns	88.04 ns	-177.38 ns	67%
System.Tests.Perf_Version.TryFormat4	345.05 ns	115.05 ns	-230.00 ns	67%
System.Collections.TryGetValueTrue<String, String>.ConcurrentDictionary(Size: 512)	49.50 μs	16.53 μs	-32.97 μs	67%
System.Tests.Perf_Version.TryFormat2	176.63 ns	59.61 ns	-117.02 ns	66%
System.Collections.ContainsKeyTrue<String, String>.ConcurrentDictionary(Size: 512)	50.43 μs	17.54 μs	-32.89 μs	65%
LinqBenchmarks.Where01ForX	1.57 secs	548.00 ms	-1022.61 ms	65%
LinqBenchmarks.Where01LinqMethodX	1.68 secs	588.39 ms	-1095.38 ms	65%

The most improved groupings of benchmark are System.Collections, System.Decimal, System.Int, and System.Text as outlined in dotnet/perf-autofiling-issues#12996, dotnet/perf-autofiling-issues#13006, dotnet/perf-autofiling-issues#13217, and dotnet/perf-autofiling-issues#13264. The changes implemented in #81695 intrinsified RuntimeHelpers.CreateSpan<T> widely used in the BCL and replaced icall performance path.

Arm64 SIMD operations implemented in #83094 and #82420 improved over 1000 microbenchmarks according to the dotnet/perf-autofiling-issues#13808, dotnet/perf-autofiling-issues#13807, dotnet/perf-autofiling-issues#14023, and dotnet/perf-autofiling-issues#13990.

The grouping of benchmarks related to System.Collections have been improved by the changes made in #81902. as outlined in dotnet/perf-autofiling-issues#13220. The changes added support for v128 constants and improved performance in about 75 microbenchmarks.

The benchmark grouping of System.Text has been improved by the addition of S.R.I Vectors in JsonReaderHelper, introduced in #81758 and outlined in dotnet/perf-autofiling-issues#12993. Furthermore, improved handling of the ldtoken+ltoken+Type::op_EqualThe optimization implemented in #81277 have significantly improved the benchmark grouping of System.Text, as detailed in dotnet/perf-autofiling-issues#12313.

The changes introduced in #81306 removed types deriving from JsonTypeInfo<T> have had a positive impact on the benchmark groupings of both System.Numerics and System.Collections, as reported in dotnet/perf-autofiling-issues#12488 and dotnet/perf-autofiling-issues#12550.

All above mentioned changes are speed-related improvements of microbechmarks. There was a significant size improvement on WASM and iOS by enabling deduplication of generics. Issue #80419 contains references to changes that reduced size on disk (SOD) for about 11% and 3% respectively.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Tests.Perf_Random.Next_long_unseeded	10.17 ns	28.84 ns	18.67 ns	-184%
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark	0.79 ns	1.96 ns	1.17 ns	-148%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark	60.14 ns	140.30 ns	80.17 ns	-133%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark	60.73 ns	132.19 ns	71.46 ns	-118%
System.Numerics.Tests.Perf_Vector4.TransformVector3ByMatrix4x4Benchmark	62.72 ns	131.48 ns	68.76 ns	-110%
System.Numerics.Tests.Perf_Vector4.TransformByMatrix4x4Benchmark	63.09 ns	131.10 ns	68.00 ns	-108%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark	56.47 ns	112.12 ns	55.65 ns	-99%
System.Numerics.Tests.Perf_Quaternion.LengthSquaredBenchmark	7.76 ns	14.35 ns	6.59 ns	-85%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark	56.66 ns	103.10 ns	46.44 ns	-82%
System.Numerics.Tests.Perf_Vector4.TransformVector2ByMatrix4x4Benchmark	61.08 ns	103.66 ns	42.58 ns	-70%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark	20.85 ns	35.00 ns	14.15 ns	-68%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint	667.85 ns	1.10 μs	428.39 ns	-64%
System.Tests.Perf_Random.Next_long_long_unseeded	14.28 ns	22.44 ns	8.15 ns	-57%
System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark	18.32 ns	28.76 ns	10.44 ns	-57%
System.Numerics.Tests.Perf_Quaternion.InverseBenchmark	26.70 ns	41.60 ns	14.89 ns	-56%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark	13.45 ns	20.35 ns	6.90 ns	-51%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong	745.74 ns	1.10 μs	357.01 ns	-48%
System.Numerics.Tests.Perf_BitOperations.Log2_ulong	894.61 ns	1.32 μs	425.98 ns	-48%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark	21.03 ns	30.87 ns	9.85 ns	-47%
System.Numerics.Tests.Perf_Vector3.ReflectBenchmark	37.23 ns	54.13 ns	16.90 ns	-45%

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report	Description
dotnet/perf-autofiling-issues#12546	Quaternion and Plane SIMD intrinsics
dotnet/perf-autofiling-issues#12957	Improve `ConcurrentDictionary` performance for strings
dotnet/perf-autofiling-issues#12660	Improved codegen of the vector accelerated `System.Numerics.*` types
dotnet/perf-autofiling-issues#13187	Implementation of Lemire's nearly divisionless method
dotnet/perf-autofiling-issues#13500	Use of `Array.Reverse<T>` in `ImmutableArray<T>.Builder.Reverse`

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512)	140.03 μs	1.76 μs	-138.26 μs	99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512)	136.03 μs	1.86 μs	-134.17 μs	99%
System.Threading.Tests.Perf_Interlocked.CompareExchange_long	37.56 ns	6.66 ns	-30.90 ns	82%
System.Threading.Tests.Perf_Interlocked.CompareExchange_int	34.18 ns	8.33 ns	-25.85 ns	76%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)	3.81 μs	1.09 μs	-2.72 μs	71%
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark	3.21 ns	0.99 ns	-2.22 ns	69%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)	3.42 μs	1.06 μs	-2.36 μs	69%
System.Tests.Perf_Decimal.Floor	175.25 ns	65.77 ns	-109.48 ns	62%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark	63.64 ns	24.08 ns	-39.56 ns	62%
System.Numerics.Tests.Perf_Quaternion.InequalityOperatorBenchmark	89.74 ns	34.82 ns	-54.93 ns	61%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False)	4.34 μs	1.70 μs	-2.64 μs	61%
System.Tests.Perf_Decimal.Round	191.52 ns	75.77 ns	-115.76 ns	60%
System.Numerics.Tests.Perf_Quaternion.DotBenchmark	77.60 ns	31.33 ns	-46.27 ns	60%
System.Numerics.Tests.Perf_Quaternion.DivideBenchmark	88.55 ns	36.47 ns	-52.07 ns	59%
System.Tests.Perf_Random.Next_int_int_unseeded	154.47 ns	65.37 ns	-89.11 ns	58%
System.Numerics.Tests.Perf_Quaternion.IsIdentityBenchmark	81.52 ns	35.06 ns	-46.46 ns	57%
System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark	83.75 ns	36.09 ns	-47.67 ns	57%
System.Numerics.Tests.Perf_Quaternion.SubtractBenchmark	84.49 ns	36.50 ns	-47.99 ns	57%
System.Collections.CtorFromCollection<Int32>.ConcurrentDictionary(Size: 512)	461.77 μs	200.10 μs	-261.67 μs	57%
System.Tests.Perf_UInt64.TryFormat(value: 0)	250.12 ns	109.72 ns	-140.40 ns	56%

The most improved groupings of benchmark are System.Collections, System.Numerics, and System.Decimal as outlined in dotnet/perf-autofiling-issues#12504, dotnet/perf-autofiling-issues#12544, dotnet/perf-autofiling-issues#13303, dotnet/perf-autofiling-issues#13247, dotnet/perf-autofiling-issues#13752, dotnet/perf-autofiling-issues#13761, and dotnet/perf-autofiling-issues#12744. The changes implemented in #81335 which intrinsified System.Numerics.* types, in #82093 which intrinsified CreateSpan, and in #81782 which introduced common Vector128 SIMD operations widely used in the BCL improved over 1000 microbenchmarks.

Implementation of synch block fast paths created a regression in Mono AOT compiler #81380, but led to an improvement of about 100 microbenchmarks in Mono Interpreter, as detailed in dotnet/perf-autofiling-issues#13245.

Similar to a change in AOT compiler, changes introduced in #81306 removed types deriving from JsonTypeInfo<T> improved several microbenchmarks in Mono Interpreter. Improve ConcurrentDictionary performance for strings in #81557 improved dotnet/perf-autofiling-issues#13003. Also, code refactors led to several improvements presented in dotnet/perf-autofiling-issues#12301.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark	0.06 ns	3.10 ns	3.04 ns	-5,059%
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark	0.36 ns	1.75 ns	1.39 ns	-391%
System.Collections.TryAddDefaultSize<String>.ConcurrentDictionary(Count: 512)	297.96 μs	574.34 μs	276.38 μs	-93%
System.Numerics.Tests.Perf_Vector2.UnitYBenchmark	7.38 ns	13.69 ns	6.31 ns	-85%
HardwareIntrinsics.RayTracer.SoA.Render	2.41 ns	4.38 ns	1.97 ns	-82%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark	48.06 ns	86.28 ns	38.22 ns	-80%
System.IO.Compression.Brotli.Compress_WithoutState(level: Fastest, file: "TestDocument.pdf")	291.36 μs	522.83 μs	231.47 μs	-79%
System.IO.Compression.Brotli.Compress_WithState(level: Fastest, file: "TestDocument.pdf")	296.93 μs	525.99 μs	229.06 μs	-77%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark	44.65 ns	75.61 ns	30.96 ns	-69%
System.Memory.Constructors_ValueTypesOnly<Byte>.ReadOnlyFromPointerLength	6.33 ns	10.49 ns	4.16 ns	-66%
PerfLabTests.EnumPerf.ObjectGetTypeNoBoxing	3.87 ns	6.20 ns	2.32 ns	-60%
System.Numerics.Tests.Perf_Vector3.SquareRootBenchmark	23.34 ns	37.02 ns	13.68 ns	-59%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark	124.53 ns	196.66 ns	72.12 ns	-58%
System.Diagnostics.Perf_Process.StartAndWaitForExit	871.51 μs	1.35 ms	474.57 μs	-54%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark	144.68 ns	217.99 ns	73.31 ns	-51%
System.Collections.AddGivenSize<String>.List(Size: 512)	12.21 μs	18.32 μs	6.11 μs	-50%
System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 2000000)	8.14 ms	12.20 ms	4.06 ms	-50%
System.Numerics.Tests.Perf_VectorOf<Int32>.ZeroBenchmark	3.20 ns	4.80 ns	1.59 ns	50%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True)	5.73 μs	8.56 μs	2.83 μs	-49%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True)	5.62 μs	8.37 μs	2.75 μs	-49%

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report	Description
dotnet/perf-autofiling-issues#12707	use of not implemented Vector operations
dotnet/perf-autofiling-issues#13747	Intrinsified common `Vector128` operations

Preview 1

This report presents .NET 8 Preview 1 overview of major performance improvements and regressions in Mono Interpreter.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 1.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanAnyBenchmark	292.17 ns	18.88 ns	-273.29 ns	94%
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark	298.08 ns	20.47 ns	-277.61 ns	93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanOrEqualAnyBenchmark	294.38 ns	20.33 ns	-274.05 ns	93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanAnyBenchmark	298.45 ns	20.63 ns	-277.82 ns	93%
System.Numerics.Tests.Perf_VectorOf<Byte>.GreaterThanOrEqualAllBenchmark	331.73 ns	24.25 ns	-307.48 ns	93%
System.Numerics.Tests.Perf_VectorOf<UInt16>.GreaterThanOrEqualAllBenchmark	218.05 ns	20.58 ns	-197.47 ns	91%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanAllBenchmark	209.57 ns	20.48 ns	-189.08 ns	90%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanOrEqualAllBenchmark	231.47 ns	23.03 ns	-208.44 ns	90%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanOrEqualAnyBenchmark	188.87 ns	20.02 ns	-168.84 ns	89%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanAnyBenchmark	186.21 ns	20.05 ns	-166.16 ns	89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanOrEqualAnyBenchmark	189.87 ns	20.76 ns	-169.11 ns	89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanAnyBenchmark	186.54 ns	21.38 ns	-165.15 ns	89%
System.Memory.Span<Byte>.IndexOfAnyFourValues(Size: 512)	11.82 μs	1.60 μs	-10.23 μs	87%
System.Memory.Span<Byte>.IndexOfAnyFiveValues(Size: 512)	14.32 μs	2.42 μs	-11.90 μs	83%
System.Numerics.Tests.Perf_VectorOf<Int32>.GreaterThanAllBenchmark	120.71 ns	20.59 ns	-100.11 ns	83%
System.Numerics.Tests.Perf_VectorOf<UInt32>.GreaterThanAllBenchmark	124.72 ns	21.39 ns	-103.32 ns	83%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanOrEqualAllBenchmark	136.11 ns	24.20 ns	-111.91 ns	82%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanAllBenchmark	128.50 ns	24.30 ns	-104.20 ns	81%
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark	105.81 ns	20.48 ns	-85.33 ns	81%
System.Numerics.Tests.Perf_VectorOf<Int64>.GreaterThanAllBenchmark	105.16 ns	20.57 ns	-84.60 ns	80%

There are a number of improvements introduced in Preview 1 to individually call out. The following section presents only major improvements with high-level analysis.
The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis.

The most improved groupings of benchmark are System.Runtime.Vectors, System.Runtime.Intrinsics and System.Collections as outlined here and in dotnet/perf-autofiling-issues#10468.
Adding stobj.vt.noref version for no reference case that is twice as fast compared to the stobj.v improved over 400 microbenchmarks as outlined in dotnet/perf-autofiling-issues#10468 and dotnet/perf-autofiling-issues#10464.

SpanHelpers are widly used in BCL and improvements related to them could significantly improve performance. Changes in 200a90a, 7fa0d5b, and c0447bc removed mono-specific SpanHelpers, replaced branch patterns with super-instructions, and improved detection of dead bblocks. Over 300 microbenchmarks are improved as outlined in dotnet/perf-autofiling-issues#10989 and dotnet/perf-autofiling-issues#11155.
Change #77331 simplified getitem.span opcode and avoided typical use of ldloca with it, which improved over 50 microbenchmarks.

Allow passing vtypes with a single scalar field to native code using the faster code path improved System.Text an System.Collections groupings of benchmarks as outlined in dotnet/perf-autofiling-issues#10987 and dotnet/perf-autofiling-issues#10938. The assumption is that those libraries rely on ObjectHandleOnStack types.

Intrinsic for string allocation newstr in #79392 improved various microbenchmarks as outlined in dotnet/perf-autofiling-issues#10694 and dotnet/perf-autofiling-issues#10670.

9a65109 contributed to dotnet/perf-autofiling-issues#10695 and dotnet/perf-autofiling-issues#10671.

All above mentioned changes are speed improvements of microbechmarks. There was a significant size improvement in web assembly by #79672 that reduced size on disk (SOD) in blazor template application for ~270kb by trimming S.N.Vector class in non-SIMD cases. With deduplication of symbols in web assembly additional size savings are achieved.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 1.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark	0.10 ns	1.10 ns	1.00 ns	-969%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql	11.63 μs	101.96 μs	90.33 μs	-777%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", ol	1.30 μs	8.82 μs	7.52 μs	-578%
System.Tests.Perf_Byte.ToString(value: 255)	38.31 ns	257.96 ns	219.65 ns	-573%
System.Tests.Perf_String.Replace_String(text: "This is a very nice sentence. This is another very nice sentence.", oldValue: "a", newValue: "b")	962.59 ns	6.30 μs	5335.40 ns	-554%
PerfLabTests.LowLevelPerf.IntegerFormatting	6.08 ms	34.30 ms	28.21 ms	-464%
System.Tests.Perf_Int32.ToString(value: 2147483647)	59.17 ns	332.19 ns	273.01 ns	-461%
System.Tests.Perf_Int16.ToString(value: 32767)	53.24 ns	297.84 ns	244.60 ns	-459%
System.Tests.Perf_Int32.ToString(value: 12345)	52.90 ns	293.56 ns	240.66 ns	-455%
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I')	531.46 ns	2.89 μs	2355.30 ns	-443%
System.Tests.Perf_SByte.ToString(value: 127)	52.62 ns	276.41 ns	223.79 ns	-425%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark	21.70 ns	108.97 ns	87.28 ns	-402%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark	26.37 ns	114.02 ns	87.65 ns	-332%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixOperatorBenchmark	246.08 ns	1.04 μs	797.11 ns	-324%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixBenchmark	243.24 ns	1.02 μs	779.98 ns	-321%
System.Tests.Perf_Byte.ToString(value: 0)	7.06 ns	27.18 ns	20.11 ns	-285%
System.Numerics.Tests.Perf_Matrix4x4.CreateTranslationFromScalarXYZ	25.27 ns	91.61 ns	66.34 ns	-263%
System.Numerics.Tests.Perf_Matrix4x4.AddBenchmark	90.93 ns	304.20 ns	213.27 ns	-235%
System.Numerics.Tests.Perf_Matrix4x4.LerpBenchmark	141.51 ns	443.45 ns	301.94 ns	-213%
System.Numerics.Tests.Perf_Matrix4x4.SubtractOperatorBenchmark	100.31 ns	307.60 ns	207.29 ns	-207%

Here is a list of ongoing regressions in Preview 1 snapshot with short description.

Issue report	Description
dotnet/perf-autofiling-issues#12299	Extracted code outside of interp main loop
dotnet/perf-autofiling-issues#11449	Investigating
dotnet/perf-autofiling-issues#11453	Redundant `ldloca` and `stfld` opcodes in the new `Matrix4x4` implementation
dotnet/perf-autofiling-issues#11147	New ASCII APIs
#79973	Dependencies update
#79336	Managed implementation of UInt32ToDecStr
#79876	Unoptimized pattern `ldstr; if (uncommon) throw ex (string)`

The text was updated successfully, but these errors were encountered:

ghost · 2023-04-04T15:05:13Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This report provides an overview of the major performance improvements and regressions in Mono AOT and Interpreter during the timeframe of .NET 8 per-preview releases.

[WIP] Preview 3

This report presents .NET 8 Preview 3 overview of major performance improvements and regressions in Mono AOT and Interpreter.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.

There are a number of improvements introduced in Preview 3 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System	Bit	Processor Name
macOS 13.0	Arm64	Apple M1
ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 3.

Improvements

The most improved groupings of benchmark are System.Numerics as outlined dotnet/perf-autofiling-issues#14023, dotnet/perf-autofiling-issues#14224, dotnet/perf-autofiling-issues#14573, and dotnet/perf-autofiling-issues#14322. The changes implemented in #82420, #83337, and #83094 introduced Arm64 SIMD operations and improved about 1000 microbenchmarks.

Regressions

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 3.

Improvements

The most improved groupings of benchmark are System.Buffers, System.Collections, System.Memory, and System.Text as outlined in dotnet/perf-autofiling-issues#14324, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14326, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14355, dotnet/perf-autofiling-issues#14359, and dotnet/perf-autofiling-issues#14361. The changes implemented in #83498 and #83490 increased inlining length limit from 20 to 30 and implemented shr.un.imm which improved over 1000 microbenchmarks.

Add vector horizontal sums on Arm64 #83675 improved about 20 microbenchmarks, as detailed in dotnet/perf-autofiling-issues#14531.

Regressions

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Preview 2

This report presents .NET 8 Preview 2 overview of major performance improvements and regressions in Mono AOT and Interpreter.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.

There are a number of improvements introduced in Preview 2 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System	Bit	Processor Name
macOS 13.0	Arm64	Apple M1
ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512)	34.07 μs	310.43 ns	-33756.76 ns	99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512)	17.32 μs	314.25 ns	-17007.28 ns	98%
System.Tests.Perf_Decimal.Floor	81.17 ns	16.81 ns	-64.36 ns	79%
System.Tests.Perf_Decimal.Round	82.24 ns	18.69 ns	-63.55 ns	77%
System.Tests.Perf_UInt32.TryFormat(value: 0)	78.23 ns	20.05 ns	-58.18 ns	74%
System.Tests.Perf_Int32.TryFormat(value: 4)	78.02 ns	20.47 ns	-57.55 ns	74%
System.Collections.TryGetValueFalse<String, String>.ConcurrentDictionary(Size: 512)	44.69 μs	12.92 μs	-31.77 μs	71%
System.Tests.Perf_Decimal.Divide	346.08 ns	102.16 ns	-243.92 ns	70%
System.Collections.ContainsKeyFalse<String, String>.ConcurrentDictionary(Size: 512)	45.29 μs	13.50 μs	-31.79 μs	70%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000)	8.93 μs	2.77 μs	-6.16 μs	69%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000)	8.83 μs	2.77 μs	-6.06 μs	69%
System.Tests.Perf_UInt64.TryFormat(value: 0)	84.40 ns	26.53 ns	-57.87 ns	69%
System.Tests.Perf_Byte.ToString(value: 255)	91.65 ns	29.95 ns	-61.69 ns	67%
System.Tests.Perf_Version.TryFormat3	265.42 ns	88.04 ns	-177.38 ns	67%
System.Tests.Perf_Version.TryFormat4	345.05 ns	115.05 ns	-230.00 ns	67%
System.Collections.TryGetValueTrue<String, String>.ConcurrentDictionary(Size: 512)	49.50 μs	16.53 μs	-32.97 μs	67%
System.Tests.Perf_Version.TryFormat2	176.63 ns	59.61 ns	-117.02 ns	66%
System.Collections.ContainsKeyTrue<String, String>.ConcurrentDictionary(Size: 512)	50.43 μs	17.54 μs	-32.89 μs	65%
LinqBenchmarks.Where01ForX	1.57 secs	548.00 ms	-1022.61 ms	65%
LinqBenchmarks.Where01LinqMethodX	1.68 secs	588.39 ms	-1095.38 ms	65%

The most improved groupings of benchmark are System.Collections, System.Decimal, System.Int, and System.Text as outlined in dotnet/perf-autofiling-issues#12996, dotnet/perf-autofiling-issues#13006, dotnet/perf-autofiling-issues#13217, and dotnet/perf-autofiling-issues#13264. The changes implemented in #81695 intrinsified RuntimeHelpers.CreateSpan<T> widely used in the BCL and replaced icall performance path.

Arm64 SIMD operations implemented in #83094 and #82420 improved over 1000 microbenchmarks according to the dotnet/perf-autofiling-issues#13808, dotnet/perf-autofiling-issues#13807, dotnet/perf-autofiling-issues#14023, and dotnet/perf-autofiling-issues#13990.

The grouping of benchmarks related to System.Collections have been improved by the changes made in #81902. as outlined in dotnet/perf-autofiling-issues#13220. The changes added support for v128 constants and improved performance in about 75 microbenchmarks.

The benchmark grouping of System.Text has been improved by the addition of S.R.I Vectors in JsonReaderHelper, introduced in #81758 and outlined in dotnet/perf-autofiling-issues#12993. Furthermore, improved handling of the ldtoken+ltoken+Type::op_EqualThe optimization implemented in #81277 have significantly improved the benchmark grouping of System.Text, as detailed in dotnet/perf-autofiling-issues#12313.

The changes introduced in #81306 removed types deriving from JsonTypeInfo<T> have had a positive impact on the benchmark groupings of both System.Numerics and System.Collections, as reported in dotnet/perf-autofiling-issues#12488 and dotnet/perf-autofiling-issues#12550.

All above mentioned changes are speed-related improvements of microbechmarks. There was a significant size improvement on WASM and iOS by enabling deduplication of generics. Issue #80419 contains references to changes that reduced size on disk (SOD) for about 11% and 3% respectively.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Tests.Perf_Random.Next_long_unseeded	10.17 ns	28.84 ns	18.67 ns	-184%
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark	0.79 ns	1.96 ns	1.17 ns	-148%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark	60.14 ns	140.30 ns	80.17 ns	-133%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark	60.73 ns	132.19 ns	71.46 ns	-118%
System.Numerics.Tests.Perf_Vector4.TransformVector3ByMatrix4x4Benchmark	62.72 ns	131.48 ns	68.76 ns	-110%
System.Numerics.Tests.Perf_Vector4.TransformByMatrix4x4Benchmark	63.09 ns	131.10 ns	68.00 ns	-108%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark	56.47 ns	112.12 ns	55.65 ns	-99%
System.Numerics.Tests.Perf_Quaternion.LengthSquaredBenchmark	7.76 ns	14.35 ns	6.59 ns	-85%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark	56.66 ns	103.10 ns	46.44 ns	-82%
System.Numerics.Tests.Perf_Vector4.TransformVector2ByMatrix4x4Benchmark	61.08 ns	103.66 ns	42.58 ns	-70%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark	20.85 ns	35.00 ns	14.15 ns	-68%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint	667.85 ns	1.10 μs	428.39 ns	-64%
System.Tests.Perf_Random.Next_long_long_unseeded	14.28 ns	22.44 ns	8.15 ns	-57%
System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark	18.32 ns	28.76 ns	10.44 ns	-57%
System.Numerics.Tests.Perf_Quaternion.InverseBenchmark	26.70 ns	41.60 ns	14.89 ns	-56%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark	13.45 ns	20.35 ns	6.90 ns	-51%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong	745.74 ns	1.10 μs	357.01 ns	-48%
System.Numerics.Tests.Perf_BitOperations.Log2_ulong	894.61 ns	1.32 μs	425.98 ns	-48%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark	21.03 ns	30.87 ns	9.85 ns	-47%
System.Numerics.Tests.Perf_Vector3.ReflectBenchmark	37.23 ns	54.13 ns	16.90 ns	-45%

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report	Description
dotnet/perf-autofiling-issues#12546	Quaternion and Plane SIMD intrinsics
dotnet/perf-autofiling-issues#12957	Improve `ConcurrentDictionary` performance for strings
dotnet/perf-autofiling-issues#12660	Improved codegen of the vector accelerated `System.Numerics.*` types
dotnet/perf-autofiling-issues#13187	Implementation of Lemire's nearly divisionless method
dotnet/perf-autofiling-issues#13500	Use of `Array.Reverse<T>` in `ImmutableArray<T>.Builder.Reverse`

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512)	140.03 μs	1.76 μs	-138.26 μs	99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512)	136.03 μs	1.86 μs	-134.17 μs	99%
System.Threading.Tests.Perf_Interlocked.CompareExchange_long	37.56 ns	6.66 ns	-30.90 ns	82%
System.Threading.Tests.Perf_Interlocked.CompareExchange_int	34.18 ns	8.33 ns	-25.85 ns	76%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)	3.81 μs	1.09 μs	-2.72 μs	71%
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark	3.21 ns	0.99 ns	-2.22 ns	69%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)	3.42 μs	1.06 μs	-2.36 μs	69%
System.Tests.Perf_Decimal.Floor	175.25 ns	65.77 ns	-109.48 ns	62%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark	63.64 ns	24.08 ns	-39.56 ns	62%
System.Numerics.Tests.Perf_Quaternion.InequalityOperatorBenchmark	89.74 ns	34.82 ns	-54.93 ns	61%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False)	4.34 μs	1.70 μs	-2.64 μs	61%
System.Tests.Perf_Decimal.Round	191.52 ns	75.77 ns	-115.76 ns	60%
System.Numerics.Tests.Perf_Quaternion.DotBenchmark	77.60 ns	31.33 ns	-46.27 ns	60%
System.Numerics.Tests.Perf_Quaternion.DivideBenchmark	88.55 ns	36.47 ns	-52.07 ns	59%
System.Tests.Perf_Random.Next_int_int_unseeded	154.47 ns	65.37 ns	-89.11 ns	58%
System.Numerics.Tests.Perf_Quaternion.IsIdentityBenchmark	81.52 ns	35.06 ns	-46.46 ns	57%
System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark	83.75 ns	36.09 ns	-47.67 ns	57%
System.Numerics.Tests.Perf_Quaternion.SubtractBenchmark	84.49 ns	36.50 ns	-47.99 ns	57%
System.Collections.CtorFromCollection<Int32>.ConcurrentDictionary(Size: 512)	461.77 μs	200.10 μs	-261.67 μs	57%
System.Tests.Perf_UInt64.TryFormat(value: 0)	250.12 ns	109.72 ns	-140.40 ns	56%

The most improved groupings of benchmark are System.Collections, System.Numerics, and System.Decimal as outlined in dotnet/perf-autofiling-issues#12504, dotnet/perf-autofiling-issues#12544, dotnet/perf-autofiling-issues#13303, dotnet/perf-autofiling-issues#13247, dotnet/perf-autofiling-issues#13752, dotnet/perf-autofiling-issues#13761, and dotnet/perf-autofiling-issues#12744. The changes implemented in #81335 which intrinsified System.Numerics.* types, in #82093 which intrinsified CreateSpan, and in #81782 which introduced common Vector128 SIMD operations widely used in the BCL improved over 1000 microbenchmarks.

Implementation of synch block fast paths created a regression in Mono AOT compiler #81380, but led to an improvement of about 100 microbenchmarks in Mono Interpreter, as detailed in dotnet/perf-autofiling-issues#13245.

Similar to a change in AOT compiler, changes introduced in #81306 removed types deriving from JsonTypeInfo<T> improved several microbenchmarks in Mono Interpreter. Improve ConcurrentDictionary performance for strings in #81557 improved dotnet/perf-autofiling-issues#13003. Also, code refactors led to several improvements presented in dotnet/perf-autofiling-issues#12301.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark	0.06 ns	3.10 ns	3.04 ns	-5,059%
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark	0.36 ns	1.75 ns	1.39 ns	-391%
System.Collections.TryAddDefaultSize<String>.ConcurrentDictionary(Count: 512)	297.96 μs	574.34 μs	276.38 μs	-93%
System.Numerics.Tests.Perf_Vector2.UnitYBenchmark	7.38 ns	13.69 ns	6.31 ns	-85%
HardwareIntrinsics.RayTracer.SoA.Render	2.41 ns	4.38 ns	1.97 ns	-82%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark	48.06 ns	86.28 ns	38.22 ns	-80%
System.IO.Compression.Brotli.Compress_WithoutState(level: Fastest, file: "TestDocument.pdf")	291.36 μs	522.83 μs	231.47 μs	-79%
System.IO.Compression.Brotli.Compress_WithState(level: Fastest, file: "TestDocument.pdf")	296.93 μs	525.99 μs	229.06 μs	-77%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark	44.65 ns	75.61 ns	30.96 ns	-69%
System.Memory.Constructors_ValueTypesOnly<Byte>.ReadOnlyFromPointerLength	6.33 ns	10.49 ns	4.16 ns	-66%
PerfLabTests.EnumPerf.ObjectGetTypeNoBoxing	3.87 ns	6.20 ns	2.32 ns	-60%
System.Numerics.Tests.Perf_Vector3.SquareRootBenchmark	23.34 ns	37.02 ns	13.68 ns	-59%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark	124.53 ns	196.66 ns	72.12 ns	-58%
System.Diagnostics.Perf_Process.StartAndWaitForExit	871.51 μs	1.35 ms	474.57 μs	-54%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark	144.68 ns	217.99 ns	73.31 ns	-51%
System.Collections.AddGivenSize<String>.List(Size: 512)	12.21 μs	18.32 μs	6.11 μs	-50%
System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 2000000)	8.14 ms	12.20 ms	4.06 ms	-50%
System.Numerics.Tests.Perf_VectorOf<Int32>.ZeroBenchmark	3.20 ns	4.80 ns	1.59 ns	50%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True)	5.73 μs	8.56 μs	2.83 μs	-49%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True)	5.62 μs	8.37 μs	2.75 μs	-49%

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report	Description
dotnet/perf-autofiling-issues#12707	use of not implemented Vector operations
dotnet/perf-autofiling-issues#13747	Intrinsified common `Vector128` operations

Preview 1

This report presents .NET 8 Preview 1 overview of major performance improvements and regressions in Mono Interpreter.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System	Bit	Processor Name
macOS 13.0	Arm64	Apple M1
ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 1.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanAnyBenchmark	292.17 ns	18.88 ns	-273.29 ns	94%
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark	298.08 ns	20.47 ns	-277.61 ns	93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanOrEqualAnyBenchmark	294.38 ns	20.33 ns	-274.05 ns	93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanAnyBenchmark	298.45 ns	20.63 ns	-277.82 ns	93%
System.Numerics.Tests.Perf_VectorOf<Byte>.GreaterThanOrEqualAllBenchmark	331.73 ns	24.25 ns	-307.48 ns	93%
System.Numerics.Tests.Perf_VectorOf<UInt16>.GreaterThanOrEqualAllBenchmark	218.05 ns	20.58 ns	-197.47 ns	91%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanAllBenchmark	209.57 ns	20.48 ns	-189.08 ns	90%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanOrEqualAllBenchmark	231.47 ns	23.03 ns	-208.44 ns	90%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanOrEqualAnyBenchmark	188.87 ns	20.02 ns	-168.84 ns	89%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanAnyBenchmark	186.21 ns	20.05 ns	-166.16 ns	89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanOrEqualAnyBenchmark	189.87 ns	20.76 ns	-169.11 ns	89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanAnyBenchmark	186.54 ns	21.38 ns	-165.15 ns	89%
System.Memory.Span<Byte>.IndexOfAnyFourValues(Size: 512)	11.82 μs	1.60 μs	-10.23 μs	87%
System.Memory.Span<Byte>.IndexOfAnyFiveValues(Size: 512)	14.32 μs	2.42 μs	-11.90 μs	83%
System.Numerics.Tests.Perf_VectorOf<Int32>.GreaterThanAllBenchmark	120.71 ns	20.59 ns	-100.11 ns	83%
System.Numerics.Tests.Perf_VectorOf<UInt32>.GreaterThanAllBenchmark	124.72 ns	21.39 ns	-103.32 ns	83%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanOrEqualAllBenchmark	136.11 ns	24.20 ns	-111.91 ns	82%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanAllBenchmark	128.50 ns	24.30 ns	-104.20 ns	81%
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark	105.81 ns	20.48 ns	-85.33 ns	81%
System.Numerics.Tests.Perf_VectorOf<Int64>.GreaterThanAllBenchmark	105.16 ns	20.57 ns	-84.60 ns	80%

There are a number of improvements introduced in Preview 1 to individually call out. The following section presents only major improvements with high-level analysis.
The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis.

The most improved groupings of benchmark are System.Runtime.Vectors, System.Runtime.Intrinsics and System.Collections as outlined here and in dotnet/perf-autofiling-issues#10468.
Adding stobj.vt.noref version for no reference case that is twice as fast compared to the stobj.v improved over 400 microbenchmarks as outlined in dotnet/perf-autofiling-issues#10468 and dotnet/perf-autofiling-issues#10464.

SpanHelpers are widly used in BCL and improvements related to them could significantly improve performance. Changes in 200a90a, 7fa0d5b, and c0447bc removed mono-specific SpanHelpers, replaced branch patterns with super-instructions, and improved detection of dead bblocks. Over 300 microbenchmarks are improved as outlined in dotnet/perf-autofiling-issues#10989 and dotnet/perf-autofiling-issues#11155.
Change #77331 simplified getitem.span opcode and avoided typical use of ldloca with it, which improved over 50 microbenchmarks.

Allow passing vtypes with a single scalar field to native code using the faster code path improved System.Text an System.Collections groupings of benchmarks as outlined in dotnet/perf-autofiling-issues#10987 and dotnet/perf-autofiling-issues#10938. The assumption is that those libraries rely on ObjectHandleOnStack types.

Intrinsic for string allocation newstr in #79392 improved various microbenchmarks as outlined in dotnet/perf-autofiling-issues#10694 and dotnet/perf-autofiling-issues#10670.

9a65109 contributed to dotnet/perf-autofiling-issues#10695 and dotnet/perf-autofiling-issues#10671.

We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this section.

All above mentioned changes are speed improvements of microbechmarks. There was a significant size improvement in web assembly by #79672 that reduced size on disk (SOD) in blazor template application for ~270kb by trimming S.N.Vector class in non-SIMD cases. With deduplication of symbols in web assembly additional size savings are achieved.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 1.

Name	Baseline Value	Compare Value	Difference	% Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark	0.10 ns	1.10 ns	1.00 ns	-969%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql	11.63 μs	101.96 μs	90.33 μs	-777%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", ol	1.30 μs	8.82 μs	7.52 μs	-578%
System.Tests.Perf_Byte.ToString(value: 255)	38.31 ns	257.96 ns	219.65 ns	-573%
System.Tests.Perf_String.Replace_String(text: "This is a very nice sentence. This is another very nice sentence.", oldValue: "a", newValue: "b")	962.59 ns	6.30 μs	5335.40 ns	-554%
PerfLabTests.LowLevelPerf.IntegerFormatting	6.08 ms	34.30 ms	28.21 ms	-464%
System.Tests.Perf_Int32.ToString(value: 2147483647)	59.17 ns	332.19 ns	273.01 ns	-461%
System.Tests.Perf_Int16.ToString(value: 32767)	53.24 ns	297.84 ns	244.60 ns	-459%
System.Tests.Perf_Int32.ToString(value: 12345)	52.90 ns	293.56 ns	240.66 ns	-455%
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I')	531.46 ns	2.89 μs	2355.30 ns	-443%
System.Tests.Perf_SByte.ToString(value: 127)	52.62 ns	276.41 ns	223.79 ns	-425%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark	21.70 ns	108.97 ns	87.28 ns	-402%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark	26.37 ns	114.02 ns	87.65 ns	-332%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixOperatorBenchmark	246.08 ns	1.04 μs	797.11 ns	-324%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixBenchmark	243.24 ns	1.02 μs	779.98 ns	-321%
System.Tests.Perf_Byte.ToString(value: 0)	7.06 ns	27.18 ns	20.11 ns	-285%
System.Numerics.Tests.Perf_Matrix4x4.CreateTranslationFromScalarXYZ	25.27 ns	91.61 ns	66.34 ns	-263%
System.Numerics.Tests.Perf_Matrix4x4.AddBenchmark	90.93 ns	304.20 ns	213.27 ns	-235%
System.Numerics.Tests.Perf_Matrix4x4.LerpBenchmark	141.51 ns	443.45 ns	301.94 ns	-213%
System.Numerics.Tests.Perf_Matrix4x4.SubtractOperatorBenchmark	100.31 ns	307.60 ns	207.29 ns	-207%

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Here is a list of ongoing regressions in Preview 1 snapshot with short description.

Issue report	Description
dotnet/perf-autofiling-issues#12299	Extracted code outside of interp main loop
dotnet/perf-autofiling-issues#11449	Investigating
dotnet/perf-autofiling-issues#11453	Redundant `ldloca` and `stfld` opcodes in the new `Matrix4x4` implementation
dotnet/perf-autofiling-issues#11147	New ASCII APIs
#79973	Dependencies update
#79336	Managed implementation of UInt32ToDecStr
#79876	Unoptimized pattern `ldstr; if (uncommon) throw ex (string)`

Author:	kotlarmilos
Assignees:	kotlarmilos
Labels:	`area-System.Numerics`, `tenet-performance`, `tenet-performance-benchmarks`, `tracking`
Milestone:	Future

kotlarmilos added tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark tracking This issue is tracking the completion of other related issues. labels Apr 4, 2023

kotlarmilos added this to the Future milestone Apr 4, 2023

kotlarmilos self-assigned this Apr 4, 2023

dotnet-issue-labeler bot added the area-System.Numerics label Apr 4, 2023

kotlarmilos removed the area-System.Numerics label Apr 4, 2023

SamMonoRT added the area-Codegen-AOT-mono label Apr 4, 2023

kotlarmilos modified the milestones: Future, 8.0.0 May 22, 2023

kotlarmilos assigned LeVladIonescu Jun 2, 2023

kotlarmilos changed the title ~~.NET 8 Per-Preview Performance report on Mono AOT and Interpreter~~ .NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter Jul 12, 2023

kotlarmilos modified the milestones: 8.0.0, 9.0.0 Aug 11, 2023

SamMonoRT assigned matouskozak and unassigned LeVladIonescu Oct 11, 2023

kotlarmilos closed this as completed Dec 8, 2023

github-actions bot locked and limited conversation to collaborators Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

kotlarmilos commented Apr 4, 2023 •

edited

Loading

ghost commented Apr 4, 2023

[WIP] Preview 3

Setup

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 2

Setup

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 1

Setup

Improvements

Regressions

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

Comments

kotlarmilos commented Apr 4, 2023 • edited Loading

Setup

Preview 7

Mono AOT compiler

Mono Interpreter

Improvements

Regressions

Preview 6

Mono AOT WASM

Improvements

Regressions

Mono AOT compiler

Mono Interpreter

Improvements

Regressions

Preview 5

Mono AOT compiler

Mono Interpreter

Improvements

Regressions

Preview 4

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 3

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 2

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 1

Improvements

Regressions

ghost commented Apr 4, 2023

[WIP] Preview 3

Setup

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 2

Setup

Mono AOT compiler

Improvements

Regressions

Mono Interpreter

Improvements

Regressions

Preview 1

Setup

Improvements

Regressions

kotlarmilos commented Apr 4, 2023 •

edited

Loading