[NativeAOT] Thin locks #79519

VSadov · 2022-12-12T01:59:04Z

This change implements "thin" locks - lightweight spinlocks that use object header bits for mutual exclusion on NativeAOT.
In more complex scenarios such as blocking waits, a thin lock transparently transitions into a "fat" lock (implemented by System.Threading.Lock) that supports more scenarios.

Fixes: #75108
Fixes: #68891

After this change the performance of C# lock statement is significantly better in both scenarios - Thin (76% faster) and Fat (30% faster). For NativeAOT this is a considerable improvement. Thin lock is also allocation-free.

CoreClr is still a bit faster. About 20% faster in the thin lock scenario and 50% faster in fat lock scenario,
Managed implementation may eventually match or be faster than native, but we need to address a few codegen inefficiencies. Fixing that should happen separately from this PR. See referenced issues.
The main difference is accessing CurrentManagedThreadId. Accessing a thread static requires a method call, while c++ implementation of locks in CoreClr just inlines the access to a native TLS variable. The cost of this call really stands out and we need to call on both acquire and release parts.

These locks are fast. An uncontended lock/unlock cycle takes ~500-700 nsec. (~1 billion in 5sec), so little suboptimal things have effect.

I had to make some changes in the fat Lock as well.

a few tweaks to make no-contention code path faster.
excessive hardcoded spinning replaced with adaptive spinning
exponential backoff on contentions for the state of the lock
fairness issues, especially once excessive spinning was addressed. (the lock is unfair by design, but we should not allow cases when one thread forever starves everyone else)

Related issues:

ghost · 2022-12-12T01:59:16Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

This change implements "thin" locks - lightweight spinlocks that use object header bits for mutual exclusion on NativeAOT.
In scenarios that require blocking wait and a few others, a thin lock transparently transition into a "fat" lock (implemented by System.Threading.Lock) that supports more scenarios.

Fixes: #75108
Fixes: #68891

After this change the performance of C# lock statement is significantly better in both scenarios - Thin (76% faster) and Fat (30% faster). For NativeAOT this is a considerable improvement.

CoreClr is still a bit faster. About 20% faster in the thin lock scenario and 50% faster in fat lock scenario,
Managed implementation may eventually match or be faster than native, but we need to address a few codegen inefficiencies. Fixing that should happen separately from this PR. I will log issues to follow up.
The main difference is accessing CurrentManagedThreadId. I have made the native side a bit faster, but it is still a method call to the native runtime for NativeAOT, while c++ implementation of locks just inlines the access. The cost of this call really stands out and we need to call on both acquire and release parts.

These locks are fast. An uncontended lock/unlock cycle takes ~500-700 nsec. (~1 billion in 5sec), so little suboptimal things have effect.

I had to make some changes in the fat Lock as well.

a few tweaks to make no-contention code path faster.
excessive hardcoded spinning replaced with adaptive spinning
exponential backoff on contentions for the state of the lock
fairness issues, especially once excessive spinning was addressed. (the lock is unfair by design, but we should not allow cases when one thread completely starves everyone else)

Related issues:

[NativeAOT] Accessing static variables, ideally, should not call runtime helpers. #79436
Optimization opportunities around Interlocked.CompareExchange #79485
< TBD intrincify access to ManagedThreadId> <- this is the most impactful.

Author:	VSadov
Assignees:	-
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

VSadov · 2022-12-12T02:01:54Z

results from the same benchmark as in #68891
cc: @adamsitnik

ConcurrentDictionary CreateAddAndClear<Int32>

Before the change:

[2022/12/09 18:26:29][INFO] |               Method |       Runtime | Size |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | RatioSD |    Gen0 |    Gen1 |   Gen2 | Allocated | Alloc Ratio |
[2022/12/09 18:26:29][INFO] |--------------------- |-------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|--------:|--------:|--------:|-------:|----------:|------------:|
[2022/12/09 18:26:29][INFO] | ConcurrentDictionary |      .NET 8.0 |  512 |  32.76 us | 0.339 us | 0.301 us |  32.76 us |  32.24 us |  33.27 us |  1.00 |    0.00 |  6.3073 |  1.4335 |      - | 107.08 KB |        1.00 |
[2022/12/09 18:26:29][INFO] | ConcurrentDictionary | NativeAOT 8.0 |  512 | 339.93 us | 5.609 us | 4.972 us | 339.32 us | 334.65 us | 349.35 us | 10.38 |    0.12 | 11.7546 | 11.4679 | 1.7202 | 171.08 KB |        1.60 |

After:
[2022/12/09 19:14:37][INFO] |               Method |       Runtime | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD |   Gen0 |   Gen1 | Allocated | Alloc Ratio |
[2022/12/09 19:14:37][INFO] |--------------------- |-------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|-------:|-------:|----------:|------------:|
[2022/12/09 19:14:37][INFO] | ConcurrentDictionary |      .NET 8.0 |  512 | 33.99 us | 0.336 us | 0.298 us | 33.98 us | 33.49 us | 34.54 us |  1.00 |    0.00 | 6.3073 | 1.4335 | 107.08 KB |        1.00 |
[2022/12/09 19:14:37][INFO] | ConcurrentDictionary | NativeAOT 8.0 |  512 | 36.77 us | 0.449 us | 0.398 us | 36.88 us | 36.23 us | 37.58 us |  1.08 |    0.02 | 6.3073 | 1.4335 | 107.08 KB |        1.00 |

We see a lot of improvements in throughput and allocations too (since thin lock is a no-allocation path).
We are closer, but still not on par with CoreClr. As I mentioned, some codegen issues stand in the way and should be addressed separately.

Perf_Monitor benchmarks

Before the change:
[2022/12/09 19:27:17][INFO] |       Method |       Runtime |     Mean |     Error |    StdDev |   Median |      Min |      Max | Ratio | RatioSD | Allocated | Alloc Ratio |
[2022/12/09 19:27:17][INFO] |------------- |-------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
[2022/12/09 19:27:17][INFO] |    EnterExit |      .NET 8.0 | 4.458 ns | 0.0293 ns | 0.0229 ns | 4.458 ns | 4.415 ns | 4.501 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:27:17][INFO] |    EnterExit | NativeAOT 8.0 | 8.048 ns | 0.0152 ns | 0.0134 ns | 8.056 ns | 8.028 ns | 8.056 ns |  1.80 |    0.01 |         - |          NA |
[2022/12/09 19:27:17][INFO] |              |               |          |           |           |          |          |          |       |         |           |             |
[2022/12/09 19:27:17][INFO] | TryEnterExit |      .NET 8.0 | 5.861 ns | 0.3099 ns | 0.3445 ns | 5.677 ns | 5.591 ns | 6.508 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:27:17][INFO] | TryEnterExit | NativeAOT 8.0 | 8.028 ns | 0.0000 ns | 0.0000 ns | 8.028 ns | 8.028 ns | 8.028 ns |  1.35 |    0.08 |         - |          NA |

After:
[2022/12/09 19:23:33][INFO] |       Method |       Runtime |     Mean |     Error |    StdDev |   Median |      Min |      Max | Ratio | RatioSD | Allocated | Alloc Ratio |
[2022/12/09 19:23:33][INFO] |------------- |-------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
[2022/12/09 19:23:33][INFO] |    EnterExit |      .NET 8.0 | 4.439 ns | 0.0236 ns | 0.0197 ns | 4.444 ns | 4.415 ns | 4.472 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:23:33][INFO] |    EnterExit | NativeAOT 8.0 | 5.019 ns | 0.0392 ns | 0.0367 ns | 5.017 ns | 4.989 ns | 5.103 ns |  1.13 |    0.01 |         - |          NA |
[2022/12/09 19:23:33][INFO] |              |               |          |           |           |          |          |          |       |         |           |             |
[2022/12/09 19:23:33][INFO] | TryEnterExit |      .NET 8.0 | 6.083 ns | 0.3853 ns | 0.4283 ns | 5.849 ns | 5.619 ns | 6.651 ns |  1.00 |    0.00 |         - |          NA |
[2022/12/09 19:23:33][INFO] | TryEnterExit | NativeAOT 8.0 | 5.164 ns | 0.0382 ns | 0.0357 ns | 5.161 ns | 5.103 ns | 5.218 ns |  0.85 |    0.06 |         - |          NA |

VSadov · 2022-12-12T02:05:37Z

Results from my microbenchmark. Just doing 1billion lock/unlock iterations and reporting time in milliseconds (smaller is better):

=== New:
Fat:
7596
Thin:
5581
Fat:
7586
Thin:
5588
Fat:
7601

== Old  (there was really no thin/fat difference, all locks were fat):
Fat:
9861
Thin:
9844
Fat:
9857
Thin:
9826
Fat:
9848

=== CoreClr:
Fat:
5287
Thin:
4772
Fat:
5332
Thin:
4794
Fat:
5279

code for reference:

using System.Diagnostics;


namespace ConsoleApp12
{
    internal class Program
    {
        static Program l1 = new Program();
        static readonly Program l2 = new Program();

        private const int iters = 1000000000;

        static void Main(string[] args)
        {
            l1 = new Program();
            l1.GetHashCode();
            for (; ; )
            {

                System.Console.WriteLine("Fat:");
                FatLock();

                System.Console.WriteLine("Thin:");
                ThinLock();
            }

            //Fairness();
        }

        static void FatLock()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iters; i++)
            {
                lock (l1)
                {

                }
            }
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void ThinLock()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iters; i++)
            {
                lock (l2)
                {

                }
            }
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void Fairness()
        {
            object l = new object();

            const int n = 32;
            Task[] workers = new Task[n];

            for (int i = 0; i < n; i++)
            {
                workers[i] = Task.Run(() =>
                {
                    for (; ; )
                    {
                        lock (l)
                        {
                            Console.Write("{00} ", Environment.CurrentManagedThreadId);

                            int ticks = Environment.TickCount;
                            while (Environment.TickCount - ticks < 50)
                                Thread.SpinWait(1);
                        }
                    }
                });
            }

            Task.WaitAll(workers);
        }

    }
}

jkotas · 2022-12-12T05:56:01Z

src/coreclr/nativeaot/Runtime/threadstore.inl

@@ -1,7 +1,12 @@
 // Licensed to the .NET Foundation under one or more agreements.
 // The .NET Foundation licenses this file to you under the MIT license.

-EXTERN_C DECLSPEC_THREAD ThreadBuffer tls_CurrentThread;
+#ifdef _MSC_VER
+// a workaround to prevent tls_CurrentThread from becoming dynamically checked/initialized.


Can we keep on extern "C" on this workaround to avoid introducing name mangling into the asm code?

I just ported the workaround from a similar change in CoreClr. I am not sure I completely understand why it is needed.
If removing extern "C" is not a part of the fix, I guess, why not. I will try.

The fix works fine with extern "C", so I will add it and revert the mangling in asm

jkotas · 2022-12-12T05:58:35Z

src/coreclr/nativeaot/Runtime/threadstore.inl

+// a workaround to prevent tls_CurrentThread from becoming dynamically checked/initialized.
+__declspec(selectany) __declspec(thread) ThreadBuffer tls_CurrentThread;
+#else
+EXTERN_C __thread ThreadBuffer tls_CurrentThread;


Does non-Windows pay the penalty for TLS init checks? (ie similar workaround is not possible on non-Windows)

From the discussion of the original fix (#33347), I gathered that even on windows the workaround was not needed until some version of MSVC.
It is probably more a function of the c runtime than OS (glibc, musl, etc). It should be fairly easy to check what happens on glibc.
I also wonder if there are differences on arm64.

Linux/glibc does not require a similar fix. There are no checks when accessing TLS.

Here is the disassembly for reference.

Linux/glibc x64

System.Collections.Tests`::RhGetCurrentManagedThreadId(): 0x5555554d5db0 <+0>: pushq %rbp 0x5555554d5db1 <+1>: movq %rsp, %rbp 0x5555554d5db4 <+4>: movq %fs:0x0, %rax 0x5555554d5dbd <+13>: leaq -0x100(%rax), %rax 0x5555554d5dc4 <+20>: movl 0xc0(%rax), %eax 0x5555554d5dca <+26>: popq %rbp 0x5555554d5dcb <+27>: retq

Linux/glibc arm64

System.Collections.Tests`::RhGetCurrentManagedThreadId(): 0xaaaaaab87400 <+0>: stp x29, x30, [sp, #-0x10]! 0xaaaaaab87404 <+4>: mov x29, sp 0xaaaaaab87408 <+8>: adrp x0, 4173 0xaaaaaab8740c <+12>: ldr x0, [x0, #0xa30] 0xaaaaaab87410 <+16>: nop 0xaaaaaab87414 <+20>: nop 0xaaaaaab87418 <+24>: mrs x8, TPIDR_EL0 0xaaaaaab8741c <+28>: add x8, x8, x0 0xaaaaaab87420 <+32>: ldr w0, [x8, #0xc0] 0xaaaaaab87424 <+36>: ldp x29, x30, [sp], #0x10 0xaaaaaab87428 <+40>: ret

MSVC (with the fix)

return ThreadStore::RawGetCurrentThread()->GetManagedThreadId(); 00007FF6A99D7AE0 mov ecx,dword ptr [_tls_index (07FF6A9CB3A28h)] 00007FF6A99D7AE6 mov rax,qword ptr gs:[58h] 00007FF6A99D7AEF mov edx,10h 00007FF6A99D7AF4 mov rax,qword ptr [rax+rcx*8] 00007FF6A99D7AF8 mov eax,dword ptr [rax+rdx+0C0h] } 00007FF6A99D7AFF ret

VSadov · 2022-12-12T06:14:41Z

src/coreclr/nativeaot/Runtime/MiscHelpers.cpp

@@ -51,11 +51,11 @@ EXTERN_C NATIVEAOT_API void __cdecl RhSpinWait(int32_t iterations)
    ASSERT(iterations > 0);

    // limit the spin count in coop mode.
-    ASSERT_MSG(iterations <= 10000 || !ThreadStore::GetCurrentThread()->IsCurrentThreadInCooperativeMode(),
+    ASSERT_MSG(iterations <= 1024 || !ThreadStore::GetCurrentThread()->IsCurrentThreadInCooperativeMode(),


This change is just because we switched to SpinWait that is calibrated to 35 nsec.
Previous numbers were assumed in terms of pre-skylake "pause", or, simply put, were divided by 8 in the underlying APIs.
I also made it power of 2. It is more convenient.

VSadov · 2022-12-12T06:31:27Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs

@@ -51,7 +52,7 @@ internal static class SyncTable
        /// </summary>
 #if DEBUG
        // Exercise table expansion more frequently in debug builds
-        private const int InitialSize = 1;
+        private const int InitialSize = 2;


The valid indices start at 2 is so that we could return 1 and 0 as true/false and the rest to mean a syncblock index for the slow path to retry with.

Nit: We can use -1 and 0 to return the result, and still start at 1.

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Thread.NativeAot.cs

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs

jkotas · 2022-12-12T07:28:57Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs

+        private static int s_unusedEntryIndex = 2;
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static ref Entry EntryRef(int i)


Are the bounds-check that expensive everywhere? I would rename it to UnsafeEntryRef and only use it in the two perf-critical places.

Bounds check adds some code into the caller and reads memory at arbitrary distance from the location that we need.
It makes difference only in two critical places though. We can use indexer in others.

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs

jkotas · 2022-12-12T07:36:49Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs


-            fixed (byte* pRawData = &o.GetRawData())
+            fixed (MethodTable** ppMethodTable = &o.GetMethodTableRef())


Pinning the object using GetMethodTableRef is not generally ok, but it is ok here since it is very low-level ObjectHeader code.

Pinning method table ptr saves computing the offset to data. We are not interested in object's data here. Yes, this is a very unusual scenario. As long as byref is within the range of the object, it is all the same to GC as it will "align" to the containing object and pin that object.

I wish the object header belonged to the object though. We would not need pinning at all.

jkotas · 2022-12-12T07:50:15Z

src/coreclr/nativeaot/Runtime/thread.cpp

@@ -1346,6 +1361,16 @@ COOP_PINVOKE_HELPER(uint64_t, RhCurrentOSThreadId, ())
    return PalGetCurrentThreadIdForLogging();
 }

+COOP_PINVOKE_HELPER(int32_t, RhGetCurrentManagedThreadId, ())


Once/if we do the codegen work to make thread statics fast, this change will be a de-optimization and we will go back to managed thread static.

Yes, once threadstatics are faster, we can just use a threadstatic.

I am thinking that we should keep using regular threadstatics for this even if it means that we are leaving some perf on the table for now.

Using threadstatics makes the lock noticeable slower.

Straightforward way with IdNone == 0 and without CurrentManagedThreadIdUnchecked is 36% slower.
-1 adjusted way with IdNone == -1 and CurrentManagedThreadIdUnchecked is 33% slower

=== current: Fat: 8031 Thin: 5684 Fat: 8034 Thin: 5680 Fat: 8000 Thin: 5810 === use threadstatic without CurrentManagedThreadIdUnchecked thin lock gets ~36% slower Fat: 9122 Thin: 7796 Fat: 9135 Thin: 7766 Fat: 9119 Thin: 7813 === use threadstatic with CurrentManagedThreadIdUnchecked and -1 adjustment so that uninitialized looks like -1 thin lock gets 33% slower Fat: 8759 Thin: 7567 Fat: 8747 Thin: 7583 Fat: 8713 Thin: 7574

I am going to switch to the "threadstatic with CurrentManagedThreadIdUnchecked and -1" variant then.

It is still better than before this PR and a possible gain of at least 33% should make improvement of threadstatics more interesting.

jkotas · 2022-12-12T07:54:57Z

Related issues:

One more codegen issue that can help this a bit: #63397

VSadov · 2022-12-12T08:16:03Z

One more codegen issue that can help this a bit: #63397

Right, there is no reason for a pinned local be on stack, it is just a matter of root reporting.

I experimented with using unpinned refs - that is obviously incorrect, but failures are rare enough to measure how much we would gain. It was not much - within noise.
It may matter more once we deal with other issues, so I figured - lets push for other issues first.
Since this came up and we already have an issue, I will add it to the list.

jkotas · 2022-12-12T23:55:15Z

...veaot/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.NativeAot.cs

@@ -104,7 +104,7 @@ internal static int GetNewHashCode()
            // where two threads consistently give out the same hash codes.
            // Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A).
            t_hashSeed = t_hashSeed * multiplier + 1;
-            return t_hashSeed;
+            return t_hashSeed + Environment.TickCount;


Instead of all this, we can just call Random.Shared.Next().

jkotas · 2022-12-12T23:56:40Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs

+        private static int s_unusedEntryIndex = 2;
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static ref Entry UnsafeEntryRef(int i)


I think it would be fine to use UnsafeEntryRef for GetHashCode too.

Reverted one too many.

VSadov · 2022-12-13T08:56:16Z

/azp run runtime-extra-platforms

azure-pipelines · 2022-12-13T08:56:31Z

Azure Pipelines successfully started running 1 pipeline(s).

jkotas · 2022-12-14T07:28:08Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.cs

+            if (acquired)
+                Debug.Assert((_state & Locked) != 0);


Suggested change

if (acquired)

Debug.Assert((_state & Locked) != 0);

Debug.Assert(!acquired || (_state & Locked) != 0);

jkotas · 2022-12-14T07:32:33Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ManagedThreadId.cs

@@ -174,7 +176,7 @@ public ImmutableIdDispenser AllocateId(out int id)
            }
        }

-        public const int IdNone = 0;
+        public const int IdNone = -1;


Was there any actual problem with 0?

Zero is a tiny bit more efficient - the instructions to check for zero tend to have smaller encodings.

-1 is so that I could just fetch the ID unchecked on the critical paths and it would automatically go on a slow path if not yet initialized, since it would not fit into 16 bits.

Where is this trick used? I am not able to find it.

CurrentManagedThreadIdUnchecked is not checking for initialization.

Not supposed to, unless i mixed up something.

Ok, I see that now. Using -1 to signal "not yet initialized" state will make it harder to switch back to thread statics once we get them optimized.

Depending how thread statics end up working, I think we may just initialize the ID to the right value, if that does not entail checking "is initialized" all the time.
Or try initializing ID early enough, before it can possibly be used in any lock. It may already be the case, but I am not 100% sure and did not want to force such invariant just yet.

Alternatively we may do -1 when accessing the value, then uninitialized 0 will be seen as -1.

Thread statics are initialized to 0.

The current initialization to -1 in the unmanaged runtime is piggy backing on "is initialized" check done by reverse pinvoke that is the only way to enter managed code in native AOT. Once this is switched back to managed thread static, the initialization to -1 will need the "is initialized" check all the time.

Alternatively we may do -1 when accessing the value.

Yes, I think it would be better.

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs

Co-authored-by: Jan Kotas <[email protected]>

jkotas · 2022-12-15T03:40:51Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.cs

            {
                // Use RhGetProcessCpuCount directly to avoid Environment.ProcessorCount->ClassConstructorRunner->Lock->Environment.ProcessorCount cycle
-                s_maxSpinCount = (RuntimeImports.RhGetProcessCpuCount() > 1) ? MaxSpinningValue : SpinningDisabled;
+                s_processorCount = RuntimeImports.RhGetProcessCpuCount();


Suggested change

s_processorCount = RuntimeImports.RhGetProcessCpuCount();

if (s_processorCount == 0)

s_processorCount = RuntimeImports.RhGetProcessCpuCount();

Avoid unnecessary writes of global state?

It would make sense if RhGetProcessCpuCount() is expensive. I guess, we do not know, so we can do the check,

The potentially most expensive part is cache-trashing caused by repeated writes.

Ah, I see, we also do not want to overwrite and invalidate the shared state every time another lock initializes its spin count.

jkotas

Nice work!

jkotas · 2022-12-15T05:15:00Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ManagedThreadId.cs

@@ -246,11 +248,16 @@ public static void RecycleId(int id)
            }
        }

+        // We do -1 adjustment so that uninitialized result would be -1 and


The -1 trick can be local to the few places in ObjectHeader.cs that need it, like: if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID). It does not need to be spread over multiple files like this.

Hmm, that would add extra condition though. Otherwise Release can't tell if lock is owned by nobody or by a thread with uninitialized thread ID.

When uninitialized ID is -1, itis too big to own a thin lock, so we do not need to check if the ID is initialized.
0 will match with unowned state.

ok, it can stay as is then.

I think Release can apply currentThreadID |= (currentThreadID - 1) >> 31 to have the same effect as if default was -1
If we store ID in off-by-one form, we still need to do "-1", so this would be just 2 extra operations, but the default can be 0.

The Acquire can do if ((uint)(currentThreadID - 1) < (uint)SBLK_MASK_LOCK_THREADID) as you suggested.

VSadov · 2022-12-15T08:31:56Z

/azp run runtime-extra-platforms

azure-pipelines · 2022-12-15T08:32:15Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2022-12-15T18:46:18Z

Thanks!!!

VSadov added the area-NativeAOT-coreclr label Dec 12, 2022

VSadov requested a review from MichalStrehovsky as a code owner December 12, 2022 01:59

ghost assigned VSadov Dec 12, 2022

VSadov requested a review from jkotas December 12, 2022 02:06

VSadov mentioned this pull request Dec 12, 2022

[NativeAOT] Inline access to thread statics #79521

Closed

jkotas reviewed Dec 12, 2022

View reviewed changes

VSadov commented Dec 12, 2022

View reviewed changes

jkotas reviewed Dec 12, 2022

View reviewed changes

build-analysis bot mentioned this pull request Dec 12, 2022

Test failure JIT\\opt\\Tailcall\\TailcallVerifyWithPrefix\\TailcallVerifyWithPrefix.cmd #79517

Closed

jkotas reviewed Dec 12, 2022

View reviewed changes

build-analysis bot mentioned this pull request Dec 13, 2022

Precondition failure: File has not had execution verified #79439

Closed

VSadov force-pushed the thinLock branch from 253bd73 to c80f064 Compare December 13, 2022 08:55

build-analysis bot mentioned this pull request Dec 13, 2022

Tracking issue for CI build timeouts #76454

Closed

VSadov mentioned this pull request Dec 13, 2022

[NativeAOT] SIGABRT Crash in System.Security.Cryptography.OpenSsl.Tests #79619

Closed

jkotas reviewed Dec 14, 2022

View reviewed changes

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 14, 2022

View reviewed changes

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Monitor.NativeAot.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 14, 2022

View reviewed changes

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/SyncTable.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 14, 2022

View reviewed changes

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/ObjectHeader.cs Outdated Show resolved Hide resolved

VSadov and others added 13 commits December 14, 2022 18:26

win-arm64 build , another try

46f28ed

Apply suggestions from code review

16183e9

Co-authored-by: Jan Kotas <[email protected]>

fix after renaming

236410f

do not report successful spin if thread has waited

0eebcab

keep extern and undo mangling of tls_CurrentThread in asm

85dc071

use SyncTable indexer in less perf-sensitive places.

77c1bec

GetNewHashCode just delegate to shared random

867a8ee

Apply suggestions from code review

b755737

Co-authored-by: Jan Kotas <[email protected]>

unchecked const conversion

6b557ec

some refactoring comments and typos

bc54d49

min number of spins in the backoff

0a78358

moved CurrentManagedThreadIdUnchecked to ManagedThreadId

367acbb

Use -1 to report success and allow using element #1 in the SyncTable

43a354d

VSadov force-pushed the thinLock branch from 1e9b7e6 to 43a354d Compare December 15, 2022 03:25

jkotas reviewed Dec 15, 2022

View reviewed changes

jkotas approved these changes Dec 15, 2022

View reviewed changes

VSadov added 2 commits December 14, 2022 20:30

use threadstatic for managed thread ID

536492b

check before calling RhGetProcessCpuCount

a96f618

jkotas reviewed Dec 15, 2022

View reviewed changes

use 0 as default thread ID

a7256ec

dotnet deleted a comment from azure-pipelines bot Dec 15, 2022

runfoapp bot mentioned this pull request Dec 15, 2022

Infra improvements for Helix #68176

Closed

VSadov merged commit d7924cc into dotnet:main Dec 15, 2022

VSadov deleted the thinLock branch December 15, 2022 18:56

This was referenced Dec 29, 2022

[NativeAot + Objective-C Marshal] memory corruption after thin locks were introduced #80031

Closed

Improve the performance of ConditionalWeakTable.TryGetValue #80059

Merged

ghost locked as resolved and limited conversation to collaborators Jan 14, 2023


		fixed (byte* pRawData = &o.GetRawData())
		fixed (MethodTable** ppMethodTable = &o.GetMethodTableRef())

	if (acquired)
	Debug.Assert((_state & Locked) != 0);
	Debug.Assert(!acquired \|\| (_state & Locked) != 0);

	s_processorCount = RuntimeImports.RhGetProcessCpuCount();
	if (s_processorCount == 0)
	s_processorCount = RuntimeImports.RhGetProcessCpuCount();

[NativeAOT] Thin locks #79519

[NativeAOT] Thin locks #79519

Conversation

VSadov commented Dec 12, 2022 • edited Loading

ghost commented Dec 12, 2022

VSadov commented Dec 12, 2022 • edited Loading

VSadov commented Dec 12, 2022

Choose a reason for hiding this comment

VSadov Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

VSadov Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

VSadov Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

jkotas commented Dec 12, 2022

VSadov commented Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov commented Dec 13, 2022

azure-pipelines bot commented Dec 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

VSadov commented Dec 15, 2022

azure-pipelines bot commented Dec 15, 2022

VSadov commented Dec 15, 2022

VSadov commented Dec 12, 2022 •

edited

Loading

VSadov commented Dec 12, 2022 •

edited

Loading

VSadov Dec 12, 2022 •

edited

Loading

VSadov Dec 12, 2022 •

edited

Loading

VSadov Dec 12, 2022 •

edited

Loading

VSadov Dec 12, 2022 •

edited

Loading

VSadov Dec 12, 2022 •

edited

Loading

VSadov Dec 12, 2022 •

edited

Loading

VSadov Dec 15, 2022 •

edited

Loading

VSadov commented Dec 12, 2022 •

edited

Loading

VSadov Dec 14, 2022 •

edited

Loading

VSadov Dec 14, 2022 •

edited

Loading

VSadov Dec 15, 2022 •

edited

Loading

VSadov Dec 15, 2022 •

edited

Loading