Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring test failures in tracing due to class loader alloc tracker heap corruption #57461

Closed
VincentBu opened this issue Aug 16, 2021 · 12 comments

Comments

@VincentBu
Copy link
Contributor

VincentBu commented Aug 16, 2021

Run: runtime-coreclr jitstress-isas-x86 20210814.1

Failed test:

CoreCLR windows x86 Checked jitstress_isas_x86_nopopcnt @ Windows.10.Amd64.Open
- tracing\\eventpipe\\reverseouter\\reverseouter\\reverseouter.cmd

CoreCLR windows x86 Checked jitstress_isas_x86_nossse3 @ Windows.10.Amd64.Open
- tracing\\eventpipe\\reverseouter\\reverseouter\\reverseouter.cmd

Error message:

cmdLine:C:\h\w\B11509CA\w\BA4B09A3\e\tracing\eventpipe\reverseouter\reverseouter\reverseouter.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 8/14/2021 11:39:39 PM, end: 8/15/2021 12:09:39 AM)

Return code:      -100
Raw output file:      C:\h\w\B11509CA\w\BA4B09A3\uploads\Reports\tracing.eventpipe\reverseouter\reverseouter\reverseouter.output.txt
Raw output:
BEGIN EXECUTION
"C:\h\w\B11509CA\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" reverseouter.dll
0.0s: ::== Running test: TEST_ReverseConnectionCanRecycleWhileTracing
0.2s: Server name is 'DOTNET_TRACE_TESTS_detbzyk5.yth'
1.0s: running sub-process: C:\h\w\B11509CA\p\corerun.exe C:\h\w\B11509CA\w\BA4B09A3\e\tracing\eventpipe\reverseouter\reverseouter\reverseouter.dll 0
2.2s: subprocess started: True
2.3s: subprocess PID: 8652
4.4s: Starting EventPipeSession over standard connection
4.6s: Waiting for connection
4.7s: Started EventPipeSession over standard connection with session id: 0x8a232a0
4.8s: Got a connection
4.8s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
4.9s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
4.9s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a connection
5.0s: Shutting down Reverse Server at DOTNET_TRACE_TESTS_detbzyk5.yth
5.0s: { Magic=System.Byte[]; ClrInstanceId=00d25c2b-aef8-4f44-b2e1-eac3813f1b3f; ProcessId=8652; Unused=0; }
5.0s: Waiting for connection
5.0s: Got a conne


Stack trace
   at tracing_eventpipe._reverseouter_reverseouter_reverseouter_._reverseouter_reverseouter_reverseouter_cmd()

Runfo Tracking Issue: reverseouter tests

Build Definition Kind Run Name

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
0 0 0
@VincentBu VincentBu added arch-x86 os-windows JitStress CLR JIT issues involving JIT internal stress modes labels Aug 16, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Aug 16, 2021
@josalem
Copy link
Contributor

josalem commented Aug 16, 2021

This and #57254 appear to be the same failure:

        6.0s: Assert failure(PID 8652 [0x000021cc], Thread: 6180 [0x1824]): Consistency check failed: Linked list pointer == NULL, memory corruption likely
        6.0s: FAILED: pDebugBlock != nullptr
        6.0s: CORECLR! CHECK::Trigger + 0x310 (0x70b88f0d)
        6.0s: CORECLR! AllocMemTracker::~AllocMemTracker + 0x24C (0x70b8b6b1)
        6.0s: CORECLR! IBCLoggerAwareAllocMemTracker::~IBCLoggerAwareAllocMemTracker + 0xDC (0x708ff85d)
        6.0s: CORECLR! ClassLoader::DoIncrementalLoad + 0x437 (0x706c677a)
        6.0s: CORECLR! ClassLoader::LoadTypeHandleForTypeKey_Body + 0x894 (0x706cf4c8)
        6.0s: CORECLR! ClassLoader::LoadTypeHandleForTypeKey + 0x19A (0x706ce70c)
        6.0s: CORECLR! ClassLoader::LoadTypeDefThrowing + 0x8A5 (0x706ce38c)
        6.0s: CORECLR! ClassLoader::LoadTypeDefOrRefThrowing + 0x6E1 (0x706cd96f)
        6.0s: CORECLR! SigPointer::GetTypeHandleThrowing + 0x926 (0x707fe472)

This looks like it is the same failure that was happening in #54469.

@josalem
Copy link
Contributor

josalem commented Aug 16, 2021

Popped open the dumps from these failures.

Looks to be coming from here (the destructor on the alloc tracker being hit at the end of the scope):

case CLASS_LOAD_BEGIN :
{
IBCLoggerAwareAllocMemTracker amTracker;
typeHnd = CreateTypeHandleForTypeKey(pTypeKey, &amTracker);
CONSISTENCY_CHECK(!typeHnd.IsNull());
TypeHandle published = PublishType(pTypeKey, typeHnd);
if (published == typeHnd)
amTracker.SuppressRelease();
typeHnd = published;
}
break;

The scope of the tracking seems to be limited to this class load if I understand the alloc tracker code correctly.

We're reaching this during loading from an R2R image and doing fixups:
image

@davidwrighton or @mangod9 or @janvorli is there something I should take a look at that might give us more information? I have the MD for the method being loaded: it's System.Diagnostics.Tracing.EventProvider.GetDataFromController which is coming from SPCL.

@josalem
Copy link
Contributor

josalem commented Aug 16, 2021

Let me amend that previous comment:

The method being reverse p/invoked is System.Diagnostics.Tracing.EventProvider.GetDataFromController, but the type being loaded as part of the fixup when the failure happens is System.Globalization.NumberFormatInfo.

@josalem
Copy link
Contributor

josalem commented Aug 17, 2021

I'm unable to reproduce this locally. @hoyosjs helped me scrape some more stats about these failures:

  • overwhelming majority are win-x86, but there are a couple win-arm32 failures as well
  • all on Windows
  • checked build
  • no tiered compilation
  • 12 failures in the last 10 days (~1 per day)
  • happens across multiple tracing/eventing tests since they all end reverse p/invoking the same way

There is at least one example of this failure happening in a CI run: https://dev.azure.com/dnceng/public/_build/results?buildId=1285558&view=ms.vss-test-web.build-test-results-tab&runId=37949368&resultId=110729&paneView=debug

Looking at the failure there, it gets a little farther, but fails to load a different type (System.Numerics.Vector). The critical bit here, is that this failure doesn't involve CG2 while the rest seemingly are all R2R images. This means the failure happens under the JIT as well.

The node in the linked list of allocations that is messed up always seems to be coming from the same place:

BYTE *
MethodTableBuilder::AllocateFromLowFrequencyHeap(S_SIZE_T cbMem)
{
CONTRACTL
{
THROWS;
GC_NOTRIGGER;
MODE_ANY;
}
CONTRACTL_END;
return (BYTE *)GetMemTracker()->Track(
GetLoaderAllocator()->GetLowFrequencyHeap()->AllocMem(cbMem));
}

@josalem josalem changed the title Test failure:tracing\\eventpipe\\reverseouter\\reverseouter\\reverseouter.cmd Recurring test failures in tracing due to class loader alloc tracker heap corruption Aug 17, 2021
@josalem josalem self-assigned this Aug 17, 2021
@josalem josalem removed JitStress CLR JIT issues involving JIT internal stress modes untriaged New issue has not been triaged by the area owner labels Aug 17, 2021
@josalem josalem added this to the 7.0.0 milestone Aug 17, 2021
@josalem
Copy link
Contributor

josalem commented Aug 17, 2021

cc @dotnet/dotnet-diag

@davidwrighton
Copy link
Member

@josalem This looks to me like something is setting the m_pFirstBlock field on the AllocMemTracker to NULL unexpectedly. I think you should add an assert to AllocMemTracker::Track_NoThrow that validates that m_pFirstBlock is not NULL. I think you'll find that there is some case where it IS null at the same rate as the assert you're looking at in this bug, and this assert will cease to repro.

However, I think the actual fault is likely tied in with the IPC work that's happening in EventPipe. In particular, if attaching and detaching the event pipe has some timeout/cancellation scenarios there is a risk that the async operation may complete at an unexpected time, and write to a local value on the stack if any stack pointers are ever passed to the file i/o functions. Of particular interest to me are the various failure paths which invove GetOverlappedResult failing. If that happens, I believe the contract is that the async operation could still complete, even if the IO has been cancelled, and such you can't reuse OVERLAPPED structures, etc until the IO is done. However, I'm not an expert in Windows IO api usage, so I'd push for someone who is to take a look.

The other possibility I see is freeing a heap object in the event pipe code, but continuing to use it. (I suggest the issue is eventpipe related as these failures seem to only happen with the event pipe shutdown code on the stack).

@lateralusX
Copy link
Member

lateralusX commented Sep 3, 2021

Found a case where we can set streaming_thread to NULL after the session have been freed on different thread (thread running IPC stop tracing command), #58615.

streaming_thread is first field of the session struct, and looking at the m_pNext in AllocMemTrackerBlock it also seems to be the first field in that struct. It is possible for IPC thread to disable the session (triggering streaming thread to exit), free session, start to run callbacks that in turn will lead to new allocations reusing the freed session memory, before streaming thread get's back and sets session->streaming_thread = NULL causing a memory corruption. The window is very small, but more possible on machines with few cores, since streaming thread can't get any cpu time between signalling rt_thread_shutdown_event and IPC thread calling ep_session_free after completing its wait on rt_thread_shutdown_event. The implementation of EP_GCX_PREEMP_EXIT could also impact size of window to repro, since on CoreCLR the descrutor will be run between signaling and setting field. On MonoVM EP_GCX_PREEMP_EXIT is a noop. I was able to manually reproduce this write after free in debugger and implemented fix in #58615 closing the race.

@lateralusX
Copy link
Member

lateralusX commented Sep 9, 2021

Failures in https://dev.azure.com/dnceng/public/_build/results?buildId=1350047 are different and most likely due to changes done in PR, several tests hits OOM:

 Unhandled exception. System.AggregateException: One or more errors occurred. (Exception of type 'System.OutOfMemoryException' was thrown.)
       ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
         at System.Collections.Hashtable.rehash(Int32 newsize)
         at System.Collections.Hashtable.Insert(Object key, Object nvalue, Boolean add)
         at System.Environment.GetEnvironmentVariables()
         at System.Diagnostics.ProcessStartInfo.get_Environment() in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessStartInfo.cs:line 84
         at Tracing.Tests.Common.Utils.RunSubprocess(Assembly currentAssembly, Dictionary`2 environment, Func`1 beforeExecution, Func`2 duringExecution, Func`1 afterExecution)
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at Tracing.Tests.ProcessEnvironmentValidation.ProcessEnvironmentValidation.Main(String[] args)

so not the same as the previous crash identified by this issue.

@josalem
Copy link
Contributor

josalem commented Jan 5, 2022

@lateralusX I believe the problem identified in this issue was fixed with #58710. Can you confirm whether we can close this?

@lateralusX
Copy link
Member

Believe we can close it, have not seen or heard of any failures similar to this since this fix went in.

@josalem josalem closed this as completed Jan 12, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants