-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce pause intrinsics in order to support spin wait loop indication #53532
Comments
Tagging subscribers to this area: @tannergooding Issue DetailsBackground and MotivationSome hardware platforms may be greatly benefit from software indication that a spin wait loop is in progress. Some common execution benefits may be observed:
As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to it's benefits and commonly recommended use, the x86 PAUSE instruction is commonly used in kernel spin locks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the .NET itself (mm_pause). However, due to the inability to indicate that a .NET loop is spinning, it's benefits are not available to regular .NET code. In the prototype the round-trip latencies were demonstrably reduced by ~29-69 nsec across a wide percentile spectrum (from the 10%'ile to the 99.9%'ile). This reduction can represent an improvement as high as ~30%-50% in best-case thread-to-thread communication latency. Please note just like any other instruction latency the PAUSE instruction may vary depending on processor architectures [8][9]:
Proposed API(using the alphanumerical order of method names): public abstract partial class Sse2 : System.Runtime.Intrinsics.X86.Sse
public static System.Runtime.Intrinsics.Vector128<sbyte> PackSignedSaturate(System.Runtime.Intrinsics.Vector128<short> left, System.Runtime.Intrinsics.Vector128<short> right) { throw null; }
public static System.Runtime.Intrinsics.Vector128<short> PackSignedSaturate(System.Runtime.Intrinsics.Vector128<int> left, System.Runtime.Intrinsics.Vector128<int> right) { throw null; }
public static System.Runtime.Intrinsics.Vector128<byte> PackUnsignedSaturate(System.Runtime.Intrinsics.Vector128<short> left, System.Runtime.Intrinsics.Vector128<short> right) { throw null; }
+ public static void Pause() { }
public static System.Runtime.Intrinsics.Vector128<short> ShiftLeftLogical(System.Runtime.Intrinsics.Vector128<short> value, byte count) { throw null; }
public static System.Runtime.Intrinsics.Vector128<short> ShiftLeftLogical(System.Runtime.Intrinsics.Vector128<short> value, System.Runtime.Intrinsics.Vector128<short> count) { throw null; }
public static System.Runtime.Intrinsics.Vector128<int> ShiftLeftLogical(System.Runtime.Intrinsics.Vector128<int> value, byte count) { throw null; } Usage ExamplesEfficient thread-to-thread communication in order to implement highly performant (and often latency sensitive) concurrent data structures and communication patterns. A simple thread-to-thread communication latency and throughput tests that measures and reports on the behavior of thread-to-thread ping-pong latencies when spinning using a shared volatile field, align with the impact of using the Stopwatch/BusySpin/SpinWait/Pause call on that latency behavior. The test can be used to measure and document the impact of Sse2.Pause() behavior on thread-to-thread communication latencies. E.g. when the two threads are pinned to the two hardware threads of a shared x86 core (with a shared L1), this test will demonstrate an estimate the best case thread-to-thread latencies possible on the platform, if the latency of measuring time with Stopwatch.GetTimestamp() is discounted (GetTimestamp latency can be separately estimated across the percentile spectrum using the PauseIntrinsics.GetTimestamp.Benchmark.Cli test in the PauseIntrinsics project). The thread-to-thread communication benchmarks (in order to measure the latency of timestamp, busyspin, spinwait, pause wait methods) project are available: https://github.com/zpodlovics/PauseIntrinsics A non-official, non-validated, non-compatible proof of concept .NET SDK benchmarking result (using modified CentOS8 dotnet5 package). WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction.Example .NET results plot (two threads on a shared core on a Xeon E5-2660v1 with SMT disabled and using all spectre / meltdown / related mitigations enabled by default): Example .NET results plot (two threads on a shared core on a Kaveri 7850K using all spectre / meltdown / related mitigations enabled by default):
Alternative DesignsDllImport or DllImport alternatives (e.g.: function pointers) can be used to spin loop with a spin-loop-indicating CPU instruction, but the DllImport / DllImport alternative boundary crossing overhead tends to be larger than the benefit provided by the instruction. .NET pattern maching could attempt to have the JIT compilers deduce spin-wait-loop situations and code and choose to automatically include a spin-loop-indicating CPU instructions with no .NET code indication required. I would expect that the complexity of automatically and reliably detecting spinning situations, coupled with questions about potential tradeoffs in using the indication on some platform to delay the availability of viable implementations significantly. RisksAn intrinsic x86 implementation will involve modifications to multiple .NET components and exposing a new Sse2.Pause Intrinsics API and as such they carry some risks, but no more than other simple intrinsics added to the .NET. Some processor architecture may have significantly different latency profile for PAUSE intrinsics (e.g.: · Intel® Xeon® Scalable processor on Skylake architecture: 140 cycles). However this is also true for every other intrinsics that is available and should not prevent the intrinsics usage and it seems that the latency improved greatly since than (e.g.: 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake architecture: 40 cycles.). References[1] LMAX Disruptor .NET implementation
|
This was briefly discussed back when intrinsics were first implemented/exposed and the recommendation from at the time was that we have more data/discussion on exposing Intrinsics are lowlevel and unsafe, but It would be good to get weigh in from @jkotas and @stephentoub on whether we feel this is useful enough to power users doing customized threading or other synchronization primitives to expose here. |
We have been introducing hw intrinsics for many corner cases to support all sorts of micro-optimizations (that I wish would be just handled by the JIT transparently instead). I do not see a fundamenal problem with adding intrinsic for pause instruction.
Should it be Also, we should add |
According to Wikipedia [1] the PAUSE instruction itself is added as a part of SSE2. However it seems that it constructed in a really clever way that it's a no-op (rep; nop) in older architectures. This is why I added it to Sse2 instead of X86Base. [1] https://en.wikipedia.org/wiki/MOV_(x86_instruction)#Added_with_SSE2 |
This one is strictly documented as being SSE2 in the architecture manuals. In practice its everywhere, much like CPUID, but since we already have the Sse2 class, I think it should just go there.
Agreed. However, that begs the question of: Should this be architecture dependent intrinsics or something like a new |
My copy of the manual says "This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors.". I do not see any mention of SSE2 in the instruction manual page that describes this instruction.
We have the architecture neutral version of this already: |
It's listed here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&text=Pause&expand=4141, likely because the P4 is the CPU that SSE2 was introduced in. It probably doesn't matter too much given both Intel and AMD list this as being a "nop" on hardware that doesn't recognize it and so I think it might be reasonable to put it in X86Base.
I meant something that users could rely on being this intrinsic ( I don't have a preference either way, just trying to determine if that's something we want or were interested in. If not, then we can update the top post with the System.Runtime.Intrinsics proposal for |
@zpodlovics, could you please update the top post to place After that we can mark this |
@tannergooding @jkotas Thanks a lot for your comments. As suggested, I updated the proposal to use X86Base.Pause and ArmBase.Yield. |
Looks good as proposed. {
public abstract class ArmBase
{
+ public static void Yield();
}
}
public abstract partial class X86Base
{
+ public static void Pause();
}
} |
Background and Motivation
Some hardware platforms may be greatly benefit from software indication that a spin wait loop is in progress.
Some common execution benefits may be observed:
The reaction time of a spin wait loop construct may be improved when a spin wait indicating is used due to various factors, reducing thread-to-thread latencies in spinning wait situations.
The power consumed by the core or hardware thread involved in the spin wait loop construct may be reduced, benefitting overall power consumption of a program, and possibly allowing other cores or hardware threads to execute at faster speeds within the same power consumption envelope.
As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to it's benefits and commonly recommended use, the x86 PAUSE instruction is commonly used in kernel spin locks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the .NET itself (mm_pause). However, due to the inability to indicate that a .NET loop is spinning, it's benefits are not available to regular .NET code.
In the prototype the round-trip latencies were demonstrably reduced by ~29-69 nsec across a wide percentile spectrum (from the 10%'ile to the 99.9%'ile). This reduction can represent an improvement as high as ~30%-50% in best-case thread-to-thread communication latency.
Please note just like any other instruction latency the PAUSE instruction may vary depending on processor architectures [8][9]:
Thanks to @jkotas suggestion Yield instrinsics will be also provided on ARM architecture at the same time to be in parity.
Proposed API changes
(using the alphanumerical order of file names):
public abstract class ArmBase public static uint ReverseElementBits(uint value); + public static void Yield(); } }
public abstract partial class X86Base public static unsafe (int Eax, int Ebx, int Ecx, int Edx) CpuId(int functionId, int subFunctionId); + public static void Pause(); } }
Usage Examples
Efficient thread-to-thread communication in order to implement highly performant (and often latency sensitive) concurrent data structures and communication patterns. A simple thread-to-thread communication latency and throughput tests that measures and reports on the behavior of thread-to-thread ping-pong latencies when spinning using a shared volatile field, align with the impact of using the Stopwatch/BusySpin/SpinWait/Pause call on that latency behavior.
The test can be used to measure and document the impact of Sse2.Pause() behavior on thread-to-thread communication latencies. E.g. when the two threads are pinned to the two hardware threads of a shared x86 core (with a shared L1), this test will demonstrate an estimate the best case thread-to-thread latencies possible on the platform, if the latency of measuring time with Stopwatch.GetTimestamp() is discounted (GetTimestamp latency can be separately estimated across the percentile spectrum using the PauseIntrinsics.GetTimestamp.Benchmark.Cli test in the PauseIntrinsics project).
The thread-to-thread communication benchmarks (in order to measure the latency of timestamp, busyspin, spinwait, pause wait methods) project are available: https://github.com/zpodlovics/PauseIntrinsics
A non-official, non-validated, non-compatible proof of concept .NET SDK benchmarking result (using modified CentOS8 dotnet5 package). WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction.
Example .NET results plot (two threads on a shared core on a Xeon E5-2660v1 with SMT disabled and using all spectre / meltdown / related mitigations enabled by default):
Example .NET results plot (two threads on a shared core on a Kaveri 7850K using all spectre / meltdown / related mitigations enabled by default):
Alternative Designs
DllImport or DllImport alternatives (e.g.: function pointers) can be used to spin loop with a spin-loop-indicating CPU instruction, but the DllImport / DllImport alternative boundary crossing overhead tends to be larger than the benefit provided by the instruction.
.NET pattern maching could attempt to have the JIT compilers deduce spin-wait-loop situations and code and choose to automatically include a spin-loop-indicating CPU instructions with no .NET code indication required. I would expect that the complexity of automatically and reliably detecting spinning situations, coupled with questions about potential tradeoffs in using the indication on some platform to delay the availability of viable implementations significantly.
Risks
An intrinsic x86 implementation will involve modifications to multiple .NET components and exposing a new Sse2.Pause Intrinsics API and as such they carry some risks, but no more than other simple intrinsics added to the .NET.
Some processor architecture may have significantly different latency profile for PAUSE intrinsics (e.g.: · Intel® Xeon® Scalable processor on Skylake architecture: 140 cycles). However this is also true for every other intrinsics that is available and should not prevent the intrinsics usage and it seems that the latency improved greatly since than (e.g.: 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake architecture: 40 cycles.).
References
[1] LMAX Disruptor .NET implementation
[2] Pause intrinsics latency and throughput benchmarks (C#)
[3] [Pause intrinsics latency and throughput benchmarks (Java)] (https://github.com/giltene/GilExamples/tree/master/SpinWaitTest)
[4] Chart depicting Java onSpinWait() intrinsification impact
[5] [.NET prototype Sse2.Pause intrinsics implementation branch] (https://github.com/zpodlovics/runtime/tree/sse2pause)
[6] Implementations on other platforms (other than x86) may choose to use the same instructions as linux cpu_relax and/or plasma_spin
[7] https://software.intel.com/content/www/us/en/develop/articles/benefitting-power-and-performance-sleep-loops.html
[8] Andreas Abel: Automatic Generation of Models of Microarchitectures
[9] https://uops.info/table.html?search=PAUSE&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_IVB=on&cb_HSW=on&cb_BDW=on&cb_SKL=on&cb_SKX=on&cb_KBL=on&cb_CFL=on&cb_CNL=on&cb_CLX=on&cb_ICL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_sse=on&cb_others=on
The text was updated successfully, but these errors were encountered: