Improve a vector implementation to support alignment and non-temporal tores #93296

tannergooding · 2023-10-10T16:32:50Z

No description provided.

ghost · 2023-10-10T16:33:03Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-System.Numerics`
Milestone:	-

… stores

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs

EgorBo · 2023-10-10T17:24:12Z

If a function is idempotent (which most of them are) it's enough a single load to align data (like I did in https://github.com/dotnet/runtime/pull/93214/files) without any unrolled loops, etc - presumably, you can expose a bool IsIdempotent {get;} property in these abstraction and only align those for simplicity

…dard

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs

tannergooding · 2023-10-11T17:12:27Z

Here are the before/after results for Abs, where Abs1 is the previous implementation and Abs2 is the implementation from this PR. We are faster across the board (except for 16 elements) and more than 2x faster for large inputs, with the gains really showing past 256 elements (1024 bytes).

Method	TensorLength	Mean	Error	StdDev
Abs1	1	2.609 ns	0.0611 ns	0.0572 ns
Abs2	1	2.562 ns	0.0518 ns	0.0485 ns
Abs1	2	3.142 ns	0.0487 ns	0.0456 ns
Abs2	2	2.563 ns	0.0121 ns	0.0101 ns
Abs1	3	3.309 ns	0.0294 ns	0.0229 ns
Abs2	3	2.742 ns	0.0171 ns	0.0133 ns
Abs1	4	3.090 ns	0.0496 ns	0.0464 ns
Abs2	4	2.572 ns	0.0187 ns	0.0166 ns
Abs1	5	5.198 ns	0.0190 ns	0.0148 ns
Abs2	5	3.133 ns	0.0788 ns	0.0774 ns
Abs1	6	5.275 ns	0.0790 ns	0.0739 ns
Abs2	6	3.098 ns	0.0782 ns	0.0732 ns
Abs1	7	5.322 ns	0.0916 ns	0.0857 ns
Abs2	7	3.164 ns	0.0544 ns	0.0509 ns
Abs1	8	2.936 ns	0.0553 ns	0.0517 ns
Abs2	8	2.573 ns	0.0442 ns	0.0414 ns
Abs1	9	5.454 ns	0.0910 ns	0.0852 ns
Abs2	9	3.165 ns	0.0536 ns	0.0501 ns
Abs1	10	5.431 ns	0.0675 ns	0.0631 ns
Abs2	10	3.230 ns	0.0308 ns	0.0288 ns
Abs1	11	5.369 ns	0.0447 ns	0.0418 ns
Abs2	11	3.199 ns	0.0443 ns	0.0393 ns
Abs1	12	5.571 ns	0.0593 ns	0.0555 ns
Abs2	12	3.211 ns	0.0320 ns	0.0299 ns
Abs1	13	5.585 ns	0.0491 ns	0.0459 ns
Abs2	13	3.262 ns	0.0295 ns	0.0276 ns
Abs1	14	5.766 ns	0.0550 ns	0.0514 ns
Abs2	14	3.264 ns	0.0292 ns	0.0273 ns
Abs1	15	5.865 ns	0.0454 ns	0.0354 ns
Abs2	15	3.254 ns	0.0451 ns	0.0377 ns
Abs1	16	2.790 ns	0.0207 ns	0.0193 ns
Abs2	16	4.984 ns	0.0479 ns	0.0448 ns
Abs1	32	6.515 ns	0.0886 ns	0.0785 ns
Abs2	32	6.012 ns	0.1203 ns	0.1067 ns
Abs1	64	17.126 ns	0.1647 ns	0.1540 ns
Abs2	64	15.343 ns	0.3172 ns	0.3115 ns
Abs1	128	38.650 ns	0.7464 ns	0.6982 ns
Abs2	128	37.150 ns	0.4134 ns	0.3665 ns
Abs1	256	82.113 ns	0.5631 ns	0.4702 ns
Abs2	256	51.379 ns	0.2897 ns	0.2710 ns
Abs1	512	169.559 ns	0.7197 ns	0.6732 ns
Abs2	512	78.668 ns	0.4391 ns	0.4107 ns
Abs1	1024	336.791 ns	1.9587 ns	1.8321 ns
Abs2	1024	133.500 ns	0.7219 ns	0.6400 ns
Abs1	2048	698.482 ns	1.3343 ns	1.1142 ns
Abs2	2048	244.729 ns	1.6655 ns	1.5579 ns
Abs1	4096	1,411.587 ns	5.8918 ns	5.2229 ns
Abs2	4096	471.249 ns	1.2273 ns	1.0880 ns
Abs1	65536	22,871.118 ns	97.4602 ns	91.1643 ns
Abs2	65536	7,306.615 ns	32.3811 ns	30.2893 ns
Abs1	131072	45,773.050 ns	244.2154 ns	228.4393 ns
Abs2	131072	18,461.823 ns	105.6972 ns	93.6978 ns

tannergooding · 2023-10-11T17:41:15Z

Similar results for Exp

Method	TensorLength	Mean	Error	StdDev
Exp1	1	5.224 ns	0.1122 ns	0.1050 ns
Exp2	1	4.863 ns	0.0823 ns	0.0770 ns
Exp1	2	7.359 ns	0.1578 ns	0.1689 ns
Exp2	2	6.872 ns	0.0975 ns	0.0912 ns
Exp1	3	9.315 ns	0.1091 ns	0.0852 ns
Exp2	3	9.018 ns	0.1643 ns	0.1537 ns
Exp1	4	4.820 ns	0.1094 ns	0.1075 ns
Exp2	4	4.863 ns	0.1028 ns	0.0962 ns
Exp1	5	9.103 ns	0.0672 ns	0.0629 ns
Exp2	5	6.563 ns	0.1169 ns	0.1036 ns
Exp1	6	9.280 ns	0.2022 ns	0.2076 ns
Exp2	6	6.685 ns	0.1498 ns	0.1538 ns
Exp1	7	9.111 ns	0.0848 ns	0.0793 ns
Exp2	7	6.660 ns	0.1413 ns	0.1321 ns
Exp1	8	5.006 ns	0.0963 ns	0.0901 ns
Exp2	8	4.848 ns	0.1006 ns	0.0941 ns
Exp1	9	10.272 ns	0.1189 ns	0.1112 ns
Exp2	9	7.593 ns	0.0876 ns	0.0820 ns
Exp1	10	10.188 ns	0.0713 ns	0.0596 ns
Exp2	10	7.581 ns	0.0935 ns	0.0875 ns
Exp1	11	10.274 ns	0.1070 ns	0.0949 ns
Exp2	11	7.631 ns	0.1412 ns	0.1320 ns
Exp1	12	10.246 ns	0.0755 ns	0.0670 ns
Exp2	12	7.555 ns	0.0742 ns	0.0658 ns
Exp1	13	10.270 ns	0.1266 ns	0.1184 ns
Exp2	13	7.651 ns	0.0916 ns	0.0857 ns
Exp1	14	10.252 ns	0.1509 ns	0.1411 ns
Exp2	14	7.585 ns	0.1017 ns	0.0951 ns
Exp1	15	10.428 ns	0.0935 ns	0.0875 ns
Exp2	15	7.795 ns	0.0801 ns	0.0749 ns
Exp1	16	7.271 ns	0.0845 ns	0.0749 ns
Exp2	16	19.323 ns	0.3291 ns	0.3078 ns
Exp1	32	18.716 ns	0.2705 ns	0.2530 ns
Exp2	32	25.631 ns	0.3860 ns	0.3610 ns
Exp1	64	48.860 ns	0.4013 ns	0.3753 ns
Exp2	64	50.285 ns	0.9766 ns	0.9135 ns
Exp1	128	109.969 ns	1.6557 ns	1.5487 ns
Exp2	128	110.952 ns	1.4802 ns	1.3846 ns
Exp1	256	230.547 ns	2.4931 ns	2.0819 ns
Exp2	256	193.106 ns	0.8509 ns	0.7105 ns
Exp1	512	477.460 ns	5.6110 ns	5.2486 ns
Exp2	512	317.518 ns	5.5631 ns	5.2037 ns
Exp1	1024	977.141 ns	13.0740 ns	12.2295 ns
Exp2	1024	573.851 ns	7.7362 ns	7.2364 ns
Exp1	2048	1,931.447 ns	12.4202 ns	10.3714 ns
Exp2	2048	1,017.003 ns	19.3625 ns	20.7177 ns
Exp1	4096	3,892.225 ns	27.1610 ns	22.6807 ns
Exp2	4096	1,975.208 ns	26.4692 ns	23.4642 ns
Exp1	65536	61,938.616 ns	354.4827 ns	296.0091 ns
Exp2	65536	32,390.331 ns	590.1723 ns	552.0475 ns
Exp1	131072	125,798.377 ns	2,260.7971 ns	2,114.7510 ns
Exp2	131072	61,126.610 ns	1,091.3727 ns	1,020.8707 ns

stephentoub · 2023-10-12T17:16:46Z

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs

@@ -12,6 +12,8 @@ namespace System.Numerics.Tensors
 {
    public static partial class TensorPrimitives
    {
+        private const nuint NonTemporalByteThreshold = 256 * 1024;


Can you add a comment about what this is and how the value was chosen?

Suggested change

private const nuint NonTemporalByteThreshold = 256 * 1024;

/// <summary>Defines the threshold, in bytes, at which non-temporal stores will be used.</summary>

/// <remarks>

/// A non-temporal store is one that allows the CPU to bypass the cache when writing to memory.

///

/// This can be beneficial when working with large amounts of memory where the writes would otherwise

/// cause large amounts of repeated updates and evictions. The hardware optimization manuals recommend

/// the threshold to be roughly half the size of the last level of on-die cache -- that is, if you have approximately

/// 4MB of L3 cache per core, you'd want this to be approx. 1-2MB, depending on if hyperthreading was enabled.

///

/// However, actually computing the amount of L3 cache per core can be tricky or error prone. Native memcpy

/// algorithms use a constant threshold that is typically around 256KB and we match that here for simplicity. This

/// threshold accounts for most processors in the last 10-15 years that had approx. 1MB L3 per core and support

/// hyperthreading, giving a per core last level cache of approx. 512KB.

/// </remarks>

private const nuint NonTemporalByteThreshold = 256 * 1024;

stephentoub · 2023-10-12T17:24:15Z

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs

+                        if (canAlign)
+                        {
+                            // Compute by how many elements we're misaligned and adjust the pointers accordingly
+                            //
+                            // Noting that we are only actually aligning dPtr. THis is because unaligned stores
+                            // are more expensive than unaligned loads and aligning both is significantly more
+                            // complex.
+
+                            nuint misalignment = ((uint)(sizeof(Vector128<float>)) - ((nuint)(dPtr) % (uint)(sizeof(Vector128<float>)))) / sizeof(float);
+
+                            xPtr += misalignment;
+                            dPtr += misalignment;
+
+                            Debug.Assert(((nuint)(dPtr) % (uint)(sizeof(Vector128<float>))) == 0);
+
+                            remainder -= misalignment;
+                        }
+
+                        Vector128<float> vector1;
+                        Vector128<float> vector2;
+                        Vector128<float> vector3;
+                        Vector128<float> vector4;
+
+                        if (canAlign && (remainder > (NonTemporalByteThreshold / sizeof(float))))


We have two checks on canAlign here. Would it help to eliminate a branch if this were instead structured as:

if (canAlign) { ... // do alignment if (remainder > (NonTemporalByteThreshold / sizeof(float))) { ... // handle non-temporal path goto AdjustingRefs; } } ... // what's currently in the else block AdjustingRefs: ... // stuff currently after the else block

?

I don't think the additional complexity is worth it here and it can cause subtle issues with control flow analysis that are probably undesirable.

If we were concerned, I'd prefer switching the order so its if ((remainder > (NonTemporalByteThreshold / sizeof(float)) && canAlign) instead.

stephentoub · 2023-10-12T17:29:31Z

src/libraries/System.Numerics.Tensors/tests/TensorPrimitivesTests.cs

@@ -20,7 +20,7 @@ public static partial class TensorPrimitivesTests
            TensorLengths.Concat(new object[][] { [0] });

        public static IEnumerable<object[]> TensorLengths =>
-            from length in Enumerable.Range(1, 128)
+            from length in Enumerable.Range(1, 256)


This is good, but it's also going to double the number of test cases we're running. Not a big deal except that on netfx it seems to struggle with the theory count. We should probably subsequently change some of those theories to be loops, e.g. instead of:

[Theory] [MemberData(nameof(TensorLengths))] public void Foo(int tensorLength) { ... }

do:

[Fact] public void Foo() { foreach (int tensorLength in TensorLengths) { ... } }

I'll handle this in the immediately following PR covering binary/ternary.

lewing · 2023-10-12T18:41:52Z

This appears to have broken the outerloop tests

tannergooding · 2023-10-12T18:45:04Z

@lewing is there an issue or a link that can be shared so it can be investigated?

lewing · 2023-10-12T18:47:57Z

https://github.com/dotnet/runtime/runs/17652611530
https://github.com/dotnet/runtime/pull/93407/checks?check_run_id=17652687038
etc

tannergooding · 2023-10-12T19:02:26Z

#93412 fixes the issue.

Another PR went in that removed unsafe and it didn't conflict, so the nothing caught the issue.

… tores (dotnet#93296) * Improve a vector implementation to support alignment and non-temporal stores * Fix a build error and mark a couple methods as AggressiveInlining * Fix the remaining block count computation * Ensure overlapping for small data on the V256/512 is handled * Ensure we only go down the vectorized path when supported for netstandard

* Use FMA in TensorPrimitives (#92205) * Simplify TensorPrimitive's AbsoluteOperator (#92577) Vector{128/256/512} all provide Abs; no need to do this manually. * Reduce some boilerplate in TensorPrimitive's IBinaryOperator (#92576) Change a few of the static abstract interface methods to be virtual, as most implementations throw from these methods; we can consolidate that throwing to the base. * Minor code cleanup in TensorPrimitives tests (#92575) * Normalize some test naming * Alphabetize tests * Improve mistmatched length tests with all positions of the shorter tensor * Alphabetize methods in TensorPrimitives.cs * Vectorize TensorPrimitives.Min/Max{Magnitude} (#92618) * Vectorize TensorPrimitives.Min/Max{Magnitude} * Use AdvSimd.Max/Min * Rename some parameters/locals for consistency * Improve HorizontalAggregate * Move a few helpers * Avoid scalar path for returning found NaN * Update TensorPrimitives aggregations to vectorize handling of remaining elements (#92672) * Update TensorPrimitives.CosineSimilarity to vectorize handling of remaining elements * Vectorize remainder handling for Aggregate helpers * Flesh out TensorPrimitives XML docs (#92749) * Flesh out TensorPrimitives XML docs * Address PR feedback - Remove use of FusedMultiplyAdd from all but CosineSimilarity - Remove comments about platform/OS-specific behavior from Add/AddMultiply/Subtract/Multiply/MultiplyAdd/Divide/Negate - Loosen comments about NaN and which exact one is returned * Address PR feedback * Vectorize TensorPrimitives.ConvertToHalf (#92715) * Enable TensorPrimitives to perform in-place operations (#92820) Some operations would produce incorrect results if the same span was passed as both an input and an output. When vectorization was employed but the span's length wasn't a perfect multiple of a vector, we'd do the standard trick of performing one last operation on the last vector's worth of data; however, that relies on the operation being idempotent, and if a previous operation has overwritten input with a new value due to the same memory being used for input and output, some operations won't be idempotent. This fixes that by masking off the already processed elements. It adds tests to validate in-place use works, and it updates the docs to carve out this valid overlapping. * Vectorize TensorPrimitives.ConvertToSingle (#92779) * Vectorize TensorPrimitives.ConvertToSingle * Address PR feedback * Throw exception in TensorPrimitives for unsupported span overlaps (#92838) * This vectorizes TensorPrimitives.Log2 (#92897) * Add a way to support operations that can't be vectorized on netstandard * Updating TensorPrimitives.Log2 to be vectorized on .NET Core * Update src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netstandard.cs Co-authored-by: Stephen Toub <[email protected]> * Ensure we do an arithmetic right shift in the Log2 vectorization * Ensure the code can compile on .NET 7 * Ensure that edge cases are properly handled and don't resolve to `x` * Ensure that Log2 special results are explicitly handled. --------- Co-authored-by: Stephen Toub <[email protected]> * Adding Log2 tests covering some special values (#92946) * [wasm] Disable `TensorPrimitivesTests.ConvertToHalf_SpecialValues` (#92953) Failing test: `System.Numerics.Tensors.Tests.TensorPrimitivesTests.ConvertToHalf_SpecialValues` Issue: #92885 * Adding a vectorized implementation of TensorPrimitives.Log (#92960) * Adding a vectorized implementation of TensorPrimitives.Log * Make sure to hit Ctrl+S * Consolidate some TensorPrimitivesTests logic around special values (#92982) * Vectorize TensorPrimitives.Exp (#93018) * Vectorize TensorPrimitives.Exp * Update src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netstandard.cs * Vectorize TensorPrimitives.Sigmoid and TensorPrimitives.SoftMax (#93029) * Vectorize TensorPrimitives.Sigmoid and TensorPrimitives.SoftMax - Adds a SigmoidOperator that just wraps the ExpOperator - Vectorizes both passes of SoftMax, on top of ExpOperator. Simplest way to do this was to augment the existing InvokeSpanScalarIntoSpan to take a transform operator. - In doing so, found some naming inconsistencies I'd previously introduced, so I did some automatic renaming to make things more consistent. - Added XML comments to all the internal/private surface area. - Fleshes out some tests (and test values). * Disable tests on mono * Address PR feedback * Vectorize TensorPrimitives.Tanh/Cosh/Sinh (#93093) * Vectorize TensorPrimitives.Tanh/Cosh/Sinh Tanh and Cosh are based on AOCL-LibM. AOCL-LibM doesn't appear to have a sinh implementation, so this Sinh is just based on the sinh formula based on exp(x). I also augmented the tests further, including: - Added more tests for sinh/cosh/tanh - Add an equality routine that supports comparing larger values with a tolerance - Tightened the tolerance for most functions - Changed some tests to be theories to be consistent with style elsewhere in the tests - Fixed some use of Math to be MathF * Remove unnecessary special-handling path from cosh * Remove unnecessary special-handling path from tanh * Redo sinh based on cosh * Address PR feedback * Replace confusing new T[] { ... } * Remove a few unnecessary `unsafe` keyword uses in TensorPrimitives (#93219) * Consolidate a few exception throws in TensorPrimitives (#93168) * Fix TensorPrimitives.IndexOfXx corner-case when first element is seed value (#93169) * Fix TensorPrimitives.IndexOfXx corner-case when first element is seed value Found as part of adding more tests for Min/Max{Magnitude} to validate they match their IndexOfXx variants. * Address PR feedback * Improve a vector implementation to support alignment and non-temporal tores (#93296) * Improve a vector implementation to support alignment and non-temporal stores * Fix a build error and mark a couple methods as AggressiveInlining * Fix the remaining block count computation * Ensure overlapping for small data on the V256/512 is handled * Ensure we only go down the vectorized path when supported for netstandard * Mark TensorPrimitives as unsafe (#93412) * Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations (#93409) * Update InvokeSpanSpanIntoSpan<TBinaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanScalarIntoSpan<TTransformOperator, TBinaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanSpanSpanIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanSpanScalarIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanScalarSpanIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Improve codegen slightly by using case 0, rather than default * Adjust the canAlign check to be latter, to reduce branch count for data under the threshold * Add a comment explaining the NonTemporalByteThreshold * Make sure xTransformOp.CanVectorize is checked on .NET Standard * Use the improved vectorization algorithm for aggregate TensorPrimitives operations (#93695) * Improve the handling of the IAggregationOperator implementations * Update Aggregate<TTransformOperator, TAggregationOperator> for TensorPrimitives to use the better SIMD algorithm * Update Aggregate<TBinaryOperator, TAggregationOperator> for TensorPrimitives to use the better SIMD algorithm * Respond to PR feedback * [wasm] Remove more active issues for #92885 (#93596) * adding patch from pr 93556 * Vectorizes IndexOfMin/Max/Magnitude (#93469) * resolved merge conflicts * net core full done * minor code cleanup * NetStandard and PR fixes. * minor pr changes * Fix IndexOfMaxMagnitudeOperator * Fix IndexOfMaxMagnitudeOperator on netcore * updates from PR comments * netcore fixed * net standard updated * add reference assembly exclusions * made naive approach better * resolved PR comments * minor comment changes * minor formatting fixes * added inlining * fixes from PR comments * comments from pr * fixed spacing --------- Co-authored-by: Eric StJohn <[email protected]> --------- Co-authored-by: Stephen Toub <[email protected]> Co-authored-by: Tanner Gooding <[email protected]> Co-authored-by: Ankit Jain <[email protected]> Co-authored-by: Radek Doulik <[email protected]> Co-authored-by: Eric StJohn <[email protected]>

ghost assigned tannergooding Oct 10, 2023

dotnet-issue-labeler bot added the area-System.Numerics label Oct 10, 2023

tannergooding force-pushed the vectorize-align branch from e948b16 to 51da7cb Compare October 10, 2023 16:33

Improve a vector implementation to support alignment and non-temporal…

0547c8a

… stores

tannergooding force-pushed the vectorize-align branch from 51da7cb to 0547c8a Compare October 10, 2023 16:41

stephentoub reviewed Oct 10, 2023

View reviewed changes

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs Outdated Show resolved Hide resolved

stephentoub reviewed Oct 10, 2023

View reviewed changes

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs Show resolved Hide resolved

stephentoub reviewed Oct 10, 2023

View reviewed changes

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs Show resolved Hide resolved

This was referenced Oct 10, 2023

[7.0][mono][arm64] futex-internal.h: No such file or directory - futex_wait_cancelable native crash in Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests #87444

Closed

LibraryImportGenerator unit test segfault illegal memory access #89054

Closed

tannergooding added 4 commits October 10, 2023 12:08

Fix a build error and mark a couple methods as AggressiveInlining

208a8df

Fix the remaining block count computation

94caf6e

Ensure overlapping for small data on the V256/512 is handled

20b3e01

Ensure we only go down the vectorized path when supported for netstan…

2314c0c

…dard

tannergooding marked this pull request as ready for review October 10, 2023 23:21

stephentoub reviewed Oct 11, 2023

View reviewed changes

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs Show resolved Hide resolved

stephentoub reviewed Oct 11, 2023

View reviewed changes

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs Show resolved Hide resolved

stephentoub reviewed Oct 12, 2023

View reviewed changes

stephentoub approved these changes Oct 12, 2023

View reviewed changes

tannergooding merged commit a451e68 into dotnet:main Oct 12, 2023
109 checks passed

tannergooding deleted the vectorize-align branch October 12, 2023 17:37

tannergooding mentioned this pull request Oct 12, 2023

Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations #93409

Merged

ghost locked as resolved and limited conversation to collaborators Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve a vector implementation to support alignment and non-temporal tores #93296

Improve a vector implementation to support alignment and non-temporal tores #93296

tannergooding commented Oct 10, 2023

ghost commented Oct 10, 2023

EgorBo commented Oct 10, 2023

tannergooding commented Oct 11, 2023 •

edited

Loading

tannergooding commented Oct 11, 2023

stephentoub Oct 12, 2023

tannergooding Oct 12, 2023

stephentoub Oct 12, 2023 •

edited

Loading

tannergooding Oct 12, 2023

stephentoub Oct 12, 2023

tannergooding Oct 12, 2023

lewing commented Oct 12, 2023

tannergooding commented Oct 12, 2023

lewing commented Oct 12, 2023 •

edited

Loading

tannergooding commented Oct 12, 2023

-        private const nuint NonTemporalByteThreshold = 256 * 1024;
+        /// <summary>Defines the threshold, in bytes, at which non-temporal stores will be used.</summary>
+        /// <remarks>
+        ///     A non-temporal store is one that allows the CPU to bypass the cache when writing to memory.
+        ///
+        ///     This can be beneficial when working with large amounts of memory where the writes would otherwise
+        ///     cause large amounts of repeated updates and evictions. The hardware optimization manuals recommend
+        ///     the threshold to be roughly half the size of the last level of on-die cache -- that is, if you have approximately
+        ///     4MB of L3 cache per core, you'd want this to be approx. 1-2MB, depending on if hyperthreading was enabled.
+        ///
+        ///     However, actually computing the amount of L3 cache per core can be tricky or error prone. Native memcpy
+        ///     algorithms use a constant threshold that is typically around 256KB and we match that here for simplicity. This
+        ///     threshold accounts for most processors in the last 10-15 years that had approx. 1MB L3 per core and support
+        ///     hyperthreading, giving a per core last level cache of approx. 512KB.
+        /// </remarks>
+        private const nuint NonTemporalByteThreshold = 256 * 1024;

Improve a vector implementation to support alignment and non-temporal tores #93296

Improve a vector implementation to support alignment and non-temporal tores #93296

Conversation

tannergooding commented Oct 10, 2023

ghost commented Oct 10, 2023

EgorBo commented Oct 10, 2023

tannergooding commented Oct 11, 2023 • edited Loading

tannergooding commented Oct 11, 2023

stephentoub Oct 12, 2023

Choose a reason for hiding this comment

tannergooding Oct 12, 2023

Choose a reason for hiding this comment

stephentoub Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

tannergooding Oct 12, 2023

Choose a reason for hiding this comment

stephentoub Oct 12, 2023

Choose a reason for hiding this comment

tannergooding Oct 12, 2023

Choose a reason for hiding this comment

lewing commented Oct 12, 2023

tannergooding commented Oct 12, 2023

lewing commented Oct 12, 2023 • edited Loading

tannergooding commented Oct 12, 2023

tannergooding commented Oct 11, 2023 •

edited

Loading

stephentoub Oct 12, 2023 •

edited

Loading

lewing commented Oct 12, 2023 •

edited

Loading