You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When multiple threads race into NDirect::CreateStructMarshalILStub to create a struct marshaling stub for the same type, it is currently possible for the threads to interleave in a manner where the generated stub has a DynamicMethodDesc::m_pSig field which points to freed memory. (In these cases, the field specifically points to a freed "szMetaSig" allocation from NDirect::CreateStructMarshalILStub.)
Since the problem depends on exact thread interleaving patterns, the associated failures are highly intermittent and therefore crop up as reliability problems.
As detailed in the "Other Information" section below, in the common case where the type being marshaled is defined in a "normal" non-corelib module (i.e., any module that uses GlobalLoaderAllocator::m_ILStubCache), the "szMetaSig" management problem is "silent" unless it happens to occur when processing a type defined in the one-and-only module that happened to drive the first ILStubCache interaction that occurred anywhere in the program. For many scenarios, the move to precompiled interop stubs in Net7 changed the identity of this "first module", and in many cases made it possible for types in non-FX modules to trigger this problem for the first time.
The problem came to light when a high-scale service started to see intermittent crashes after moving to Net7. These crashes occurred during concurrent marshaling of a layout-bearing struct which was defined in a non-FX module and which used a ByValArray to model the trailing part of the Win32 REPARSE_DATA_BUFFER structure (mimicked by the DataBuffer* types in the repro app). The crash signature in these cases was obscure because it generally involved MethodDescCallSite::CallTargetWorker running ArgIterator against invalid DynamicMethodDesc::m_pSig content, which leads to a failure to correctly move the incoming "pArguments" into the outgoing CallDescrData, which leads to garbage arguments flowing into the struct marshaling stub, which leads to unpredictable effects (large-scale stack corruption in these specific cases).
The linked repro app uses racing Marshal.StructureToPtr operations to generate situations where this problem occurs and leads to an observable crash. In my experiments on a VM where %NUMBER_OF_PROCESSORS% is 8, running "TryStructMarshal.exe 40" generates a crash 10+% of the time (e.g., running it 100 times in a loop reliably generates 10+ crashes).
Configuration
The problem generally exists in any configuration where the IL_STUB_StructMarshal feature is used. Running on Win11, experiments with the linked repro app have concretely confirmed that the problem is present across at least Net6, Net7, and Net8.
Regression?
No.
The problem was introduced whenever the broader IL_STUB_StructMarshal feature was introduced.
As described above and below, there are a number of scenarios where this problem was "suddenly" reachable in Net7, but this was not due to any regression in Net7 (and was instead just a reflection of crash exposure existing only in the "first module" to drive ILStubCache interaction, combined with this "first module" often changing in Net7 due to the introduction of precompiled interop stubs).
Other information
The root cause is a latent problem in how the "pGeneratedNewStub" indicator is computed in CreateInteropILStub. This latent problem generally leads to intermittent crashes whenever multiple threads race to create a struct marshaling stub for a type defined in whatever module drove the program's first non-corelib ILStubCache interaction (i.e., whatever module owns the GlobalLoaderAllocator::m_ILStubCache m_pStubMT MethodTable).
In CreateInteropILStub, stub generation is done in two phases: "phase 1" initialization creates the stub MD, and "phase 2" initialization generates the associated IL. Crucially, locks are released and re-acquired between the two phases, meaning the thread which drives phase 1 is not guaranteed to be the thread which drives phase 2.
The "pGeneratedNewStub" indicator is only used in NDirect::CreateStructMarshalILStub. In this context, the indicator is used to determine whether the preceding CreateInteropILStub call created a new DynamicMethodDesc (i.e., whether the preceding call carried out the phase 1 initialization of the returned stub). If the preceding call created the DynamicMethodDesc, the "szMetaSig" allocation needs to remain allocated since it may be referenced directly from the DynamicMethodDesc::m_pSig field. Otherwise, the "szMetaSig" allocation is no longer needed and can be released.
The current CreateInteropILStub implementation sets "pGeneratedNewStub" to true if and only if the current call carried out phase 2 initialization of the returned stub. But this is a mismatch because, as described above, the caller will use "pGeneratedNewStub" to determine whether the call carried out phase 1 initialization.
This mismatch can turn into crashes when multiple threads race into NDirect::CreateStructMarshalILStub to create the struct marshaling stub for the same struct type.
Specifically, the racing calls to CreateInteropILStub can lead to an execution pattern where phase 1 for a given stub happens on thread T1, but phase 2 for the same stub happens on a different thread T2. In this case, the "szMetaSig" allocation on T1 needs to be retained because that is where DynamicMethodDesc construction occurred. But the current CreateInteropILStub code sets "pGeneratedNewStub" to true only on T2. This means the crucial "szMetaSig" allocation on T1 gets released, the unreferenced and unneeded "szMetaSig" allocation on T2 gets retained, and the generated DynamicMethodDesc::m_pSig field potentially ends up pointing to freed memory as a result.
In my tests, the repro app crash rate is similar across Net6 and Net7. That said, it is generally easier to see crashes on Net7, and the repro app is taking careful steps to ensure that Net6 crashes still occur. Specifically, some move to precompiled interop stubs in Net7 makes it much more common for the GlobalLoaderAllocator::m_ILStubCache m_pStubMT MethodTable to end up being owned by a non-FX module. If m_pStubMT is owned by an FX module instead, then "szMetaSig" lifetime problems become "invisible" because marshaling of app-owned structs will always take the CreateModuleIndependentSignature path in ILStubCache::CreateNewMethodDesc. In Net6, ILStubCache usage while running FX code could easily steer m_pStubMT to an FX module (e.g., to System.Console.dll if your program starts with a Console.WriteLine). In Net7, precompiled interop stubs in the FX code have seemingly eliminated a lot of this ILStubCache usage, making it much easier to end up with an app-owned m_pStubMT. See #WHY_MANY_SCENARIOS_DID_NOT_CRASH_PRIOR_TO_NET7 in the repro app for more details on all of this.
Relevant to the discussion above, running "TryStructMarshal.exe 40 StartWithFxPInvoke" adjusts the repro to start with a Console.WriteLine call instead of an app-module-driven PInvoke. On Net6, this change injects ILStubCache interaction at the very start of the program; this steers m_pStubMT to System.Console.dll and the crashes disappear as a result. On Net7, this change does not inject any new ILStubCache interaction, so m_pStubMT remains app-owned and the crash rate remains unchanged.
The text was updated successfully, but these errors were encountered:
Description
When multiple threads race into NDirect::CreateStructMarshalILStub to create a struct marshaling stub for the same type, it is currently possible for the threads to interleave in a manner where the generated stub has a DynamicMethodDesc::m_pSig field which points to freed memory. (In these cases, the field specifically points to a freed "szMetaSig" allocation from NDirect::CreateStructMarshalILStub.)
Since the problem depends on exact thread interleaving patterns, the associated failures are highly intermittent and therefore crop up as reliability problems.
As detailed in the "Other Information" section below, in the common case where the type being marshaled is defined in a "normal" non-corelib module (i.e., any module that uses GlobalLoaderAllocator::m_ILStubCache), the "szMetaSig" management problem is "silent" unless it happens to occur when processing a type defined in the one-and-only module that happened to drive the first ILStubCache interaction that occurred anywhere in the program. For many scenarios, the move to precompiled interop stubs in Net7 changed the identity of this "first module", and in many cases made it possible for types in non-FX modules to trigger this problem for the first time.
The problem came to light when a high-scale service started to see intermittent crashes after moving to Net7. These crashes occurred during concurrent marshaling of a layout-bearing struct which was defined in a non-FX module and which used a ByValArray to model the trailing part of the Win32 REPARSE_DATA_BUFFER structure (mimicked by the DataBuffer* types in the repro app). The crash signature in these cases was obscure because it generally involved MethodDescCallSite::CallTargetWorker running ArgIterator against invalid DynamicMethodDesc::m_pSig content, which leads to a failure to correctly move the incoming "pArguments" into the outgoing CallDescrData, which leads to garbage arguments flowing into the struct marshaling stub, which leads to unpredictable effects (large-scale stack corruption in these specific cases).
The linked repro app uses racing Marshal.StructureToPtr operations to generate situations where this problem occurs and leads to an observable crash. In my experiments on a VM where %NUMBER_OF_PROCESSORS% is 8, running "TryStructMarshal.exe 40" generates a crash 10+% of the time (e.g., running it 100 times in a loop reliably generates 10+ crashes).
Configuration
The problem generally exists in any configuration where the IL_STUB_StructMarshal feature is used. Running on Win11, experiments with the linked repro app have concretely confirmed that the problem is present across at least Net6, Net7, and Net8.
Regression?
No.
The problem was introduced whenever the broader IL_STUB_StructMarshal feature was introduced.
As described above and below, there are a number of scenarios where this problem was "suddenly" reachable in Net7, but this was not due to any regression in Net7 (and was instead just a reflection of crash exposure existing only in the "first module" to drive ILStubCache interaction, combined with this "first module" often changing in Net7 due to the introduction of precompiled interop stubs).
Other information
The root cause is a latent problem in how the "pGeneratedNewStub" indicator is computed in CreateInteropILStub. This latent problem generally leads to intermittent crashes whenever multiple threads race to create a struct marshaling stub for a type defined in whatever module drove the program's first non-corelib ILStubCache interaction (i.e., whatever module owns the GlobalLoaderAllocator::m_ILStubCache m_pStubMT MethodTable).
In CreateInteropILStub, stub generation is done in two phases: "phase 1" initialization creates the stub MD, and "phase 2" initialization generates the associated IL. Crucially, locks are released and re-acquired between the two phases, meaning the thread which drives phase 1 is not guaranteed to be the thread which drives phase 2.
The "pGeneratedNewStub" indicator is only used in NDirect::CreateStructMarshalILStub. In this context, the indicator is used to determine whether the preceding CreateInteropILStub call created a new DynamicMethodDesc (i.e., whether the preceding call carried out the phase 1 initialization of the returned stub). If the preceding call created the DynamicMethodDesc, the "szMetaSig" allocation needs to remain allocated since it may be referenced directly from the DynamicMethodDesc::m_pSig field. Otherwise, the "szMetaSig" allocation is no longer needed and can be released.
The current CreateInteropILStub implementation sets "pGeneratedNewStub" to true if and only if the current call carried out phase 2 initialization of the returned stub. But this is a mismatch because, as described above, the caller will use "pGeneratedNewStub" to determine whether the call carried out phase 1 initialization.
This mismatch can turn into crashes when multiple threads race into NDirect::CreateStructMarshalILStub to create the struct marshaling stub for the same struct type.
Specifically, the racing calls to CreateInteropILStub can lead to an execution pattern where phase 1 for a given stub happens on thread T1, but phase 2 for the same stub happens on a different thread T2. In this case, the "szMetaSig" allocation on T1 needs to be retained because that is where DynamicMethodDesc construction occurred. But the current CreateInteropILStub code sets "pGeneratedNewStub" to true only on T2. This means the crucial "szMetaSig" allocation on T1 gets released, the unreferenced and unneeded "szMetaSig" allocation on T2 gets retained, and the generated DynamicMethodDesc::m_pSig field potentially ends up pointing to freed memory as a result.
In my tests, the repro app crash rate is similar across Net6 and Net7. That said, it is generally easier to see crashes on Net7, and the repro app is taking careful steps to ensure that Net6 crashes still occur. Specifically, some move to precompiled interop stubs in Net7 makes it much more common for the GlobalLoaderAllocator::m_ILStubCache m_pStubMT MethodTable to end up being owned by a non-FX module. If m_pStubMT is owned by an FX module instead, then "szMetaSig" lifetime problems become "invisible" because marshaling of app-owned structs will always take the CreateModuleIndependentSignature path in ILStubCache::CreateNewMethodDesc. In Net6, ILStubCache usage while running FX code could easily steer m_pStubMT to an FX module (e.g., to System.Console.dll if your program starts with a Console.WriteLine). In Net7, precompiled interop stubs in the FX code have seemingly eliminated a lot of this ILStubCache usage, making it much easier to end up with an app-owned m_pStubMT. See #WHY_MANY_SCENARIOS_DID_NOT_CRASH_PRIOR_TO_NET7 in the repro app for more details on all of this.
Relevant to the discussion above, running "TryStructMarshal.exe 40 StartWithFxPInvoke" adjusts the repro to start with a Console.WriteLine call instead of an app-module-driven PInvoke. On Net6, this change injects ILStubCache interaction at the very start of the program; this steers m_pStubMT to System.Console.dll and the crashes disappear as a result. On Net7, this change does not inject any new ILStubCache interaction, so m_pStubMT remains app-owned and the crash rate remains unchanged.
The text was updated successfully, but these errors were encountered: