-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Thin locks #79519
[NativeAOT] Thin locks #79519
Conversation
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsThis change implements "thin" locks - lightweight spinlocks that use object header bits for mutual exclusion on NativeAOT. After this change the performance of C# CoreClr is still a bit faster. About 20% faster in the thin lock scenario and 50% faster in fat lock scenario, These locks are fast. An uncontended lock/unlock cycle takes ~500-700 nsec. (~1 billion in 5sec), so little suboptimal things have effect. I had to make some changes in the fat Lock as well.
Related issues:
|
results from the same benchmark as in #68891
We see a lot of improvements in throughput and allocations too (since thin lock is a no-allocation path).
|
Results from my microbenchmark. Just doing 1billion lock/unlock iterations and reporting time in milliseconds (smaller is better):
code for reference: using System.Diagnostics;
namespace ConsoleApp12
{
internal class Program
{
static Program l1 = new Program();
static readonly Program l2 = new Program();
private const int iters = 1000000000;
static void Main(string[] args)
{
l1 = new Program();
l1.GetHashCode();
for (; ; )
{
System.Console.WriteLine("Fat:");
FatLock();
System.Console.WriteLine("Thin:");
ThinLock();
}
//Fairness();
}
static void FatLock()
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
lock (l1)
{
}
}
System.Console.WriteLine(sw.ElapsedMilliseconds);
}
static void ThinLock()
{
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
lock (l2)
{
}
}
System.Console.WriteLine(sw.ElapsedMilliseconds);
}
static void Fairness()
{
object l = new object();
const int n = 32;
Task[] workers = new Task[n];
for (int i = 0; i < n; i++)
{
workers[i] = Task.Run(() =>
{
for (; ; )
{
lock (l)
{
Console.Write("{00} ", Environment.CurrentManagedThreadId);
int ticks = Environment.TickCount;
while (Environment.TickCount - ticks < 50)
Thread.SpinWait(1);
}
}
});
}
Task.WaitAll(workers);
}
}
} |
@@ -1,7 +1,12 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. | |||
// The .NET Foundation licenses this file to you under the MIT license. | |||
|
|||
EXTERN_C DECLSPEC_THREAD ThreadBuffer tls_CurrentThread; | |||
#ifdef _MSC_VER | |||
// a workaround to prevent tls_CurrentThread from becoming dynamically checked/initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep on extern "C"
on this workaround to avoid introducing name mangling into the asm code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ported the workaround from a similar change in CoreClr. I am not sure I completely understand why it is needed.
If removing extern "C" is not a part of the fix, I guess, why not. I will try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix works fine with extern "C", so I will add it and revert the mangling in asm
// a workaround to prevent tls_CurrentThread from becoming dynamically checked/initialized. | ||
__declspec(selectany) __declspec(thread) ThreadBuffer tls_CurrentThread; | ||
#else | ||
EXTERN_C __thread ThreadBuffer tls_CurrentThread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does non-Windows pay the penalty for TLS init checks? (ie similar workaround is not possible on non-Windows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion of the original fix (#33347), I gathered that even on windows the workaround was not needed until some version of MSVC.
It is probably more a function of the c runtime than OS (glibc, musl, etc). It should be fairly easy to check what happens on glibc.
I also wonder if there are differences on arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linux/glibc does not require a similar fix. There are no checks when accessing TLS.
Here is the disassembly for reference.
Linux/glibc x64
System.Collections.Tests`::RhGetCurrentManagedThreadId():
0x5555554d5db0 <+0>: pushq %rbp
0x5555554d5db1 <+1>: movq %rsp, %rbp
0x5555554d5db4 <+4>: movq %fs:0x0, %rax
0x5555554d5dbd <+13>: leaq -0x100(%rax), %rax
0x5555554d5dc4 <+20>: movl 0xc0(%rax), %eax
0x5555554d5dca <+26>: popq %rbp
0x5555554d5dcb <+27>: retq
Linux/glibc arm64
System.Collections.Tests`::RhGetCurrentManagedThreadId():
0xaaaaaab87400 <+0>: stp x29, x30, [sp, #-0x10]!
0xaaaaaab87404 <+4>: mov x29, sp
0xaaaaaab87408 <+8>: adrp x0, 4173
0xaaaaaab8740c <+12>: ldr x0, [x0, #0xa30]
0xaaaaaab87410 <+16>: nop
0xaaaaaab87414 <+20>: nop
0xaaaaaab87418 <+24>: mrs x8, TPIDR_EL0
0xaaaaaab8741c <+28>: add x8, x8, x0
0xaaaaaab87420 <+32>: ldr w0, [x8, #0xc0]
0xaaaaaab87424 <+36>: ldp x29, x30, [sp], #0x10
0xaaaaaab87428 <+40>: ret
MSVC (with the fix)
return ThreadStore::RawGetCurrentThread()->GetManagedThreadId();
00007FF6A99D7AE0 mov ecx,dword ptr [_tls_index (07FF6A9CB3A28h)]
00007FF6A99D7AE6 mov rax,qword ptr gs:[58h]
00007FF6A99D7AEF mov edx,10h
00007FF6A99D7AF4 mov rax,qword ptr [rax+rcx*8]
00007FF6A99D7AF8 mov eax,dword ptr [rax+rdx+0C0h]
}
00007FF6A99D7AFF ret
@@ -51,11 +51,11 @@ EXTERN_C NATIVEAOT_API void __cdecl RhSpinWait(int32_t iterations) | |||
ASSERT(iterations > 0); | |||
|
|||
// limit the spin count in coop mode. | |||
ASSERT_MSG(iterations <= 10000 || !ThreadStore::GetCurrentThread()->IsCurrentThreadInCooperativeMode(), | |||
ASSERT_MSG(iterations <= 1024 || !ThreadStore::GetCurrentThread()->IsCurrentThreadInCooperativeMode(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is just because we switched to SpinWait that is calibrated to 35 nsec.
Previous numbers were assumed in terms of pre-skylake "pause", or, simply put, were divided by 8 in the underlying APIs.
I also made it power of 2. It is more convenient.
@@ -51,7 +52,7 @@ internal static class SyncTable | |||
/// </summary> | |||
#if DEBUG | |||
// Exercise table expansion more frequently in debug builds | |||
private const int InitialSize = 1; | |||
private const int InitialSize = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The valid indices start at 2 is so that we could return 1 and 0 as true/false and the rest to mean a syncblock index for the slow path to retry with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We can use -1 and 0 to return the result, and still start at 1.
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Thread.NativeAot.cs
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs
Outdated
Show resolved
Hide resolved
private static int s_unusedEntryIndex = 2; | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static ref Entry EntryRef(int i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the bounds-check that expensive everywhere? I would rename it to UnsafeEntryRef
and only use it in the two perf-critical places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bounds check adds some code into the caller and reads memory at arbitrary distance from the location that we need.
It makes difference only in two critical places though. We can use indexer in others.
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs
Outdated
Show resolved
Hide resolved
|
||
fixed (byte* pRawData = &o.GetRawData()) | ||
fixed (MethodTable** ppMethodTable = &o.GetMethodTableRef()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning the object using GetMethodTableRef
is not generally ok, but it is ok here since it is very low-level ObjectHeader code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning method table ptr saves computing the offset to data. We are not interested in object's data here. Yes, this is a very unusual scenario. As long as byref is within the range of the object, it is all the same to GC as it will "align" to the containing object and pin that object.
I wish the object header belonged to the object though. We would not need pinning at all.
@@ -1346,6 +1361,16 @@ COOP_PINVOKE_HELPER(uint64_t, RhCurrentOSThreadId, ()) | |||
return PalGetCurrentThreadIdForLogging(); | |||
} | |||
|
|||
COOP_PINVOKE_HELPER(int32_t, RhGetCurrentManagedThreadId, ()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once/if we do the codegen work to make thread statics fast, this change will be a de-optimization and we will go back to managed thread static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, once threadstatics are faster, we can just use a threadstatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we should keep using regular threadstatics for this even if it means that we are leaving some perf on the table for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using threadstatics makes the lock noticeable slower.
Straightforward way with IdNone == 0
and without CurrentManagedThreadIdUnchecked
is 36% slower.
-1 adjusted way with IdNone == -1
and CurrentManagedThreadIdUnchecked
is 33% slower
=== current:
Fat:
8031
Thin:
5684
Fat:
8034
Thin:
5680
Fat:
8000
Thin:
5810
=== use threadstatic without CurrentManagedThreadIdUnchecked
thin lock gets ~36% slower
Fat:
9122
Thin:
7796
Fat:
9135
Thin:
7766
Fat:
9119
Thin:
7813
=== use threadstatic with CurrentManagedThreadIdUnchecked and -1
adjustment so that uninitialized looks like -1
thin lock gets 33% slower
Fat:
8759
Thin:
7567
Fat:
8747
Thin:
7583
Fat:
8713
Thin:
7574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to switch to the "threadstatic with CurrentManagedThreadIdUnchecked and -1" variant then.
It is still better than before this PR and a possible gain of at least 33% should make improvement of threadstatics more interesting.
One more codegen issue that can help this a bit: #63397 |
Right, there is no reason for a pinned local be on stack, it is just a matter of root reporting. I experimented with using unpinned refs - that is obviously incorrect, but failures are rare enough to measure how much we would gain. It was not much - within noise. |
@@ -104,7 +104,7 @@ internal static int GetNewHashCode() | |||
// where two threads consistently give out the same hash codes. | |||
// Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A). | |||
t_hashSeed = t_hashSeed * multiplier + 1; | |||
return t_hashSeed; | |||
return t_hashSeed + Environment.TickCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all this, we can just call Random.Shared.Next()
.
private static int s_unusedEntryIndex = 2; | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static ref Entry UnsafeEntryRef(int i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to use UnsafeEntryRef
for GetHashCode
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted one too many.
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
if (acquired) | ||
Debug.Assert((_state & Locked) != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (acquired) | |
Debug.Assert((_state & Locked) != 0); | |
Debug.Assert(!acquired || (_state & Locked) != 0); |
@@ -174,7 +176,7 @@ public ImmutableIdDispenser AllocateId(out int id) | |||
} | |||
} | |||
|
|||
public const int IdNone = 0; | |||
public const int IdNone = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any actual problem with 0?
Zero is a tiny bit more efficient - the instructions to check for zero tend to have smaller encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 is so that I could just fetch the ID unchecked on the critical paths and it would automatically go on a slow path if not yet initialized, since it would not fit into 16 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this trick used? I am not able to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentManagedThreadIdUnchecked
is not checking for initialization.
Not supposed to, unless i mixed up something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see that now. Using -1 to signal "not yet initialized" state will make it harder to switch back to thread statics once we get them optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending how thread statics end up working, I think we may just initialize the ID to the right value, if that does not entail checking "is initialized" all the time.
Or try initializing ID early enough, before it can possibly be used in any lock. It may already be the case, but I am not 100% sure and did not want to force such invariant just yet.
Alternatively we may do -1
when accessing the value, then uninitialized 0 will be seen as -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread statics are initialized to 0.
The current initialization to -1 in the unmanaged runtime is piggy backing on "is initialized" check done by reverse pinvoke that is the only way to enter managed code in native AOT. Once this is switched back to managed thread static, the initialization to -1 will need the "is initialized" check all the time.
Alternatively we may do -1 when accessing the value.
Yes, I think it would be better.
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs
Outdated
Show resolved
Hide resolved
src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Jan Kotas <[email protected]>
Co-authored-by: Jan Kotas <[email protected]>
{ | ||
// Use RhGetProcessCpuCount directly to avoid Environment.ProcessorCount->ClassConstructorRunner->Lock->Environment.ProcessorCount cycle | ||
s_maxSpinCount = (RuntimeImports.RhGetProcessCpuCount() > 1) ? MaxSpinningValue : SpinningDisabled; | ||
s_processorCount = RuntimeImports.RhGetProcessCpuCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s_processorCount = RuntimeImports.RhGetProcessCpuCount(); | |
if (s_processorCount == 0) | |
s_processorCount = RuntimeImports.RhGetProcessCpuCount(); |
Avoid unnecessary writes of global state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense if RhGetProcessCpuCount()
is expensive. I guess, we do not know, so we can do the check,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potentially most expensive part is cache-trashing caused by repeated writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, we also do not want to overwrite and invalidate the shared state every time another lock initializes its spin count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
@@ -246,11 +248,16 @@ public static void RecycleId(int id) | |||
} | |||
} | |||
|
|||
// We do -1 adjustment so that uninitialized result would be -1 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -1
trick can be local to the few places in ObjectHeader.cs that need it, like: if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID)
. It does not need to be spread over multiple files like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that would add extra condition though. Otherwise Release can't tell if lock is owned by nobody or by a thread with uninitialized thread ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When uninitialized ID is -1, itis too big to own a thin lock, so we do not need to check if the ID is initialized.
0 will match with unowned state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it can stay as is then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Release can apply currentThreadID |= (currentThreadID - 1) >> 31
to have the same effect as if default was -1
If we store ID in off-by-one form, we still need to do "-1", so this would be just 2 extra operations, but the default can be 0.
The Acquire can do if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID)
as you suggested.
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
Thanks!!! |
This change implements "thin" locks - lightweight spinlocks that use object header bits for mutual exclusion on NativeAOT.
In more complex scenarios such as blocking waits, a thin lock transparently transitions into a "fat" lock (implemented by
System.Threading.Lock
) that supports more scenarios.Fixes: #75108
Fixes: #68891
After this change the performance of C#
lock
statement is significantly better in both scenarios - Thin (76% faster) and Fat (30% faster). For NativeAOT this is a considerable improvement. Thin lock is also allocation-free.CoreClr is still a bit faster. About 20% faster in the thin lock scenario and 50% faster in fat lock scenario,
Managed implementation may eventually match or be faster than native, but we need to address a few codegen inefficiencies. Fixing that should happen separately from this PR. See referenced issues.
The main difference is accessing
CurrentManagedThreadId
. Accessing a thread static requires a method call, while c++ implementation of locks in CoreClr just inlines the access to a native TLS variable. The cost of this call really stands out and we need to call on both acquire and release parts.These locks are fast. An uncontended lock/unlock cycle takes ~500-700 nsec. (~1 billion in 5sec), so little suboptimal things have effect.
I had to make some changes in the fat Lock as well.
Related issues: