Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties diagnosing SIGSEV in dotnet core 2.2.8 #485

Closed
taion809 opened this issue Dec 3, 2019 · 18 comments
Closed

Difficulties diagnosing SIGSEV in dotnet core 2.2.8 #485

taion809 opened this issue Dec 3, 2019 · 18 comments
Labels
area-VM-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@taion809
Copy link

taion809 commented Dec 3, 2019

Hello,
We're currently running into a segfault issue with a dotnet core application running in a linux docker container. We have managed to produce a coredump on signal SIGSEV however our investigation hasn't returned anything related to our code specifically.

Are there additional steps we can take to debug what may be causing this segfault?

Info

dotnet --info

root@d80b6d51aaa1:/app# dotnet --info
  It was not possible to find any installed .NET Core SDKs
  Did you mean to run .NET Core SDK commands? Install a .NET Core SDK from:
      https://aka.ms/dotnet-download

Host (useful for support):
  Version: 3.0.1
  Commit:  19942e7199

.NET Core SDKs installed:
  No SDKs were found.

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.2.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.2.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.0.1 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.2.8 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.0.1 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download
bt
* thread #1, name = 'Api', stop reason = signal SIGSEGV
  * frame #0: 0x00007f525d914730 libcoreclr.so`CLREventBase::Set()
    frame #1: 0x00007f525db5b533 libcoreclr.so`ObjectNative::Pulse(Object*) + 211
    frame #2: 0x00007f51e429b2f3
    frame #3: 0x00007f51e7c83d3e
    frame #4: 0x00007f51e7c5d521
    frame #5: 0x00007f51e7c5cd7b
    frame #6: 0x00007f51e7c5cd01
    frame #7: 0x00007f51e7c5c5ab
    frame #8: 0x00007f51e7c5c1a2
    frame #9: 0x00007f51e7c5c08b
    frame #10: 0x00007f51e7c5bc98
    frame #11: 0x00007f51e7c5b63b
    frame #12: 0x00007f51e7c5b5c1
    frame #13: 0x00007f51e7c58e7e
    frame #14: 0x00007f51e7c588fb
    frame #15: 0x00007f51e7c58881
    frame #16: 0x00007f51e7c57fb8
    frame #17: 0x00007f51e7c57e1b
    frame #18: 0x00007f51e7c57da4
    frame #19: 0x00007f51e7c56197
    frame #20: 0x00007f51e7c5533b
    frame #21: 0x00007f51e7c552b3
    frame #22: 0x00007f51e7c54f80
    frame #23: 0x00007f51e7c54ddb
    frame #24: 0x00007f51e7c54d54
    frame #25: 0x00007f51e7c54c80
    frame #26: 0x00007f51e7c54586
    frame #27: 0x00007f51e7c5385b
    frame #28: 0x00007f51e7c53563
    frame #29: 0x00007f51e7c53280
    frame #30: 0x00007f51e7c52e2b
    frame #31: 0x00007f51e7c52b49
    frame #32: 0x00007f51e7c52a15
    frame #33: 0x00007f51e47cd1da
    frame #34: 0x00007f51e47cc553
    frame #35: 0x00007f51e75df182
    frame #36: 0x00007f51e41b34dd
    frame #37: 0x00007f51e42b0588
    frame #38: 0x00007f525d94f17f libcoreclr.so`CallDescrWorkerInternal + 124
    frame #39: 0x00007f525d86d67a libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 954
    frame #40: 0x00007f525d9f0d15 libcoreclr.so`QueueUserWorkItemManagedCallback(void*) + 181
    frame #41: 0x00007f525d83e7df libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 431
    frame #42: 0x00007f525d83ef90 libcoreclr.so`ManagedThreadBase::ThreadPool(ADID, void (*)(void*), void*) + 64
    frame #43: 0x00007f525d9d4e47 libcoreclr.so`ManagedPerAppDomainTPCount::DispatchWorkItem(bool*, bool*) + 295
    frame #44: 0x00007f525d85d903 libcoreclr.so`ThreadpoolMgr::WorkerThreadStart(void*) + 1267
    frame #45: 0x00007f525dbdc525 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 309
    frame #46: 0x00007f525f1a36db libpthread.so.0`start_thread(arg=0x00007f514b7fe700) at pthread_create.c:463
    frame #47: 0x00007f525e58d88f libc.so.6`__GI___clone at clone.S:95
clrstack -all -a
(lldb) clrstack -all -a
OS Thread Id: 0xd
        Child SP               IP Call Site
00007FFD376A7790 00007f525f1a99f3 [GCFrame: 00007ffd376a7790]
00007FFD376A7870 00007f525f1a99f3 [HelperMethodFrame_1OBJ: 00007ffd376a7870] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007FFD376A79A0 00007F51E42AD4A2 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
    PARAMETERS:
        this (0x00007FFD376A79B0) = 0x00007f51c4132140
        millisecondsTimeout (<CLR reg>) = 0x00000000ffffffff
        cancellationToken = <no data>
    LOCALS:
        <CLR reg> = 0x0000000000000000
        <CLR reg> = 0x0000000000000000
        <CLR reg> = 0x00000000ffffffff
        <no data>
        <no data>
        <no data>
        0x00007FFD376A79A8 = 0x00007f51c4132168
        <no data>
        0x00007FFD376A79CC = 0x0000000000000000
        <no data>

00007FFD376A7A30 00007F51E42789E9 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
    PARAMETERS:
        this (0x00007FFD376A7A48) = 0x00007f51c41320b0
        millisecondsTimeout = <no data>
        cancellationToken = <no data>
    LOCALS:
        <no data>
        <no data>
        <no data>
        0x00007FFD376A7A40 = 0x00007f51c4132140
        <no data>

00007FFD376A7A90 00007F51E4278879 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)
    PARAMETERS:
        this (<CLR reg>) = 0x00007f51c41320b0
        millisecondsTimeout = <no data>
        cancellationToken = <no data>
    LOCALS:
        <no data>
        <CLR reg> = 0x00007f51c4021088
        0x00007FFD376A7AB4 = 0x0000000000000000
        <no data>
        <no data>

00007FFD376A7AF0 00007F51E42996B6 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
    PARAMETERS:
        task (<CLR reg>) = 0x00007f51c41320b0
    LOCALS:
        <no data>
        <no data>

00007FFD376A7B10 00007F51E72A10AB /app/Microsoft.AspNetCore.Hosting.dll!Unknown

00007FFD376A7B30 00007F51E4791C72 /app/Api.dll!Unknown

00007FFD376A7E58 00007f525d94f17f [GCFrame: 00007ffd376a7e58]
00007FFD376A8260 00007f525d94f17f [GCFrame: 00007ffd376a8260]
OS Thread Id: 0x15
        Child SP               IP Call Site
00007F5259E06D40 00007f525f1a9ed9 [DebuggerU2MCatchHandlerFrame: 00007f5259e06d40]
OS Thread Id: 0x16
        Child SP               IP Call Site
OS Thread Id: 0x1a
        Child SP               IP Call Site
00007F5248C9D818 00007f525e58dbb7 [InlinedCallFrame: 00007f5248c9d818] /app/System.Net.Sockets.dll!Unknown
00007F5248C9D818 00007f51e47da778 [InlinedCallFrame: 00007f5248c9d818] /app/System.Net.Sockets.dll!Unknown
00007F5248C9D810 00007F51E47DA778 <unknown method>
    PARAMETERS:
        <no data>
        <no data>
        <no data>

OS Thread Id: 0x3b
        Child SP               IP Call Site
00007F51C27F9818 00007f525e58dbb7 [InlinedCallFrame: 00007f51c27f9818] /app/System.Net.Sockets.dll!Unknown
00007F51C27F9818 00007f51e47da778 [InlinedCallFrame: 00007f51c27f9818] /app/System.Net.Sockets.dll!Unknown
00007F51C27F9810 00007F51E47DA778 <unknown method>
    PARAMETERS:
        <no data>
        <no data>
        <no data>

OS Thread Id: 0x44
        Child SP               IP Call Site
00007F51697F9818 00007f525e58dbb7 [InlinedCallFrame: 00007f51697f9818] /app/System.Net.Sockets.dll!Unknown
00007F51697F9818 00007f51e47da778 [InlinedCallFrame: 00007f51697f9818] /app/System.Net.Sockets.dll!Unknown
00007F51697F9810 00007F51E47DA778 <unknown method>
    PARAMETERS:
        <no data>
        <no data>
        <no data>

OS Thread Id: 0x45
        Child SP               IP Call Site
00007F5168FF8818 00007f525e58dbb7 [InlinedCallFrame: 00007f5168ff8818] /app/System.Net.Sockets.dll!Unknown
00007F5168FF8818 00007f51e47da778 [InlinedCallFrame: 00007f5168ff8818] /app/System.Net.Sockets.dll!Unknown
00007F5168FF8810 00007F51E47DA778 <unknown method>
    PARAMETERS:
        <no data>
        <no data>
        <no data>

OS Thread Id: 0x82
        Child SP               IP Call Site
OS Thread Id: 0x84
        Child SP               IP Call Site
OS Thread Id: 0x88
        Child SP               IP Call Site
OS Thread Id: 0x8b
        Child SP               IP Call Site
00007F5246344520 00007f525f1a9ed9 [GCFrame: 00007f5246344520]
00007F5246344600 00007f525f1a9ed9 [HelperMethodFrame_1OBJ: 00007f5246344600] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
OS Thread Id: 0x90
        Child SP               IP Call Site
OS Thread Id: 0x91
        Child SP               IP Call Site
00007F514B7FC680 00007f525d914730 [HelperMethodFrame_1OBJ: 00007f514b7fc680] System.Threading.Monitor.ObjPulse(System.Object)
OS Thread Id: 0x93
        Child SP               IP Call Site
00007F51E3AD0C90 00007f525d8bd011 [DebuggerU2MCatchHandlerFrame: 00007f51e3ad0c90]
OS Thread Id: 0x94
        Child SP               IP Call Site
OS Thread Id: 0x96
        Child SP               IP Call Site
OS Thread Id: 0x97
        Child SP               IP Call Site
OS Thread Id: 0x98
        Child SP               IP Call Site
OS Thread Id: 0x99
        Child SP               IP Call Site
OS Thread Id: 0x9a
        Child SP               IP Call Site
OS Thread Id: 0x9b
        Child SP               IP Call Site
OS Thread Id: 0x9c
        Child SP               IP Call Site
OS Thread Id: 0x9d
        Child SP               IP Call Site
clrthreads
(lldb) clrthreads
ThreadCount:      23
UnstartedThread:  0
BackgroundThread: 22
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                        Lock
 DBG   ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
  29    1    d 000000000269A600  2020020 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn
  19    2   15 000000000256F1B0    21220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn (Finalizer)
  17    3   16 00007F51B4000C50  1020220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn (Threadpool Worker)
  10    6   1a 00007F51B800D580    21220 Preemptive  00007F51C7016680:00007F51C7018550 00000000025D1150 0     Ukn
   8   35   3b 00007F51940CB5E0    21220 Preemptive  00007F51C70146A8:00007F51C7016550 00000000025D1150 0     Ukn
  14   14   44 00007F5178009B00    21220 Preemptive  00007F51C70105E0:00007F51C7012550 00000000025D1150 0     Ukn
  26   21   45 00007F51580193F0    21220 Preemptive  00007F51C7018748:00007F51C701A550 00000000025D1150 0     Ukn
  20    5   82 00007F514C0F21F0  1021220 Preemptive  00007F51C7299260:00007F51C729A5B0 00000000025D1150 0     Ukn (Threadpool Worker)
   5   30   84 00007F514C15B760  1021220 Preemptive  00007F51C732AB58:00007F51C732C5B0 00000000025D1150 0     Ukn (Threadpool Worker)
  15   24   88 00007F51B8218330  1021220 Preemptive  00007F51C72785E0:00007F51C727A5B0 00000000025D1150 0     Ukn (Threadpool Worker)
  12   32   8b 00007F51B80E92E0  3021220 Preemptive  00007F51C7240988:00007F51C72425B0 00000000025D1150 0     Ukn (Threadpool Worker)
   7   22   90 00007F5160158AD0  1021220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn (Threadpool Worker)
   1   39   91 00007F513C1987A0  1021220 Cooperative 00007F51C7248008:00007F51C72485B0 00000000025D1150 1     Ukn (Threadpool Worker)
   6   20   93 00007F5198414CF0  1021220 Cooperative 00007F51C73BC118:00007F51C73BC5B0 00000000025D1150 0     Ukn (Threadpool Worker)
  18   42   94 00007F51382ED4B0  1021220 Preemptive  00007F51C72D8F78:00007F51C72DA5B0 00000000025D1150 0     Ukn (Threadpool Worker)
  27   40   96 00007F51A813A7D0  1021220 Preemptive  00007F51C735C5B0:00007F51C735C5B0 00000000025D1150 0     Ukn (Threadpool Worker)
   3   19   97 00007F5140045DF0  1021220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn (Threadpool Worker)
   2   12   98 00007F5138224000    21220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn
   4    7   99 00007F513814B4F0  1021220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn (Threadpool Worker)
  13    8   9a 00007F516011B120  1021220 Preemptive  00007F51C736AF58:00007F51C736C5B0 00000000025D1150 0     Ukn (Threadpool Worker)
  16   17   9b 00007F512C08F090  1021220 Preemptive  00007F51C7370688:00007F51C73725B0 00000000025D1150 0     Ukn (Threadpool Worker)
   9    4   9c 00007F513C10DF00  1021220 Preemptive  00007F51C73AFCB8:00007F51C73B05B0 00000000025D1150 0     Ukn (Threadpool Worker)
  24   18   9d 00007F51A007A170  1021220 Preemptive  0000000000000000:0000000000000000 00000000025D1150 0     Ukn (Threadpool Worker)
(lldb) syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
   44 00000000025F73B8            1         1 00007F513C1987A0 91   1   00007f51c4441200 System.Object
-----------------------------
Total           172
Free            26
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 3, 2019
@wfurt
Copy link
Member

wfurt commented Dec 4, 2019

cc: @janvorli

@janvorli
Copy link
Member

janvorli commented Dec 5, 2019

The thread #1 isn't most likely the one that's crashing. When you load crashdump, it doesn't show the crashing thread as the current one and it also prints the "stop reason = signal SIGSEGV" for each thread.
Can you please run clrstack -f -all -a? This will also print native frames in the stack traces. The result will be probably quite large, so it may be better to share it using gist instead of pasting it directly into this issue.

@taion809
Copy link
Author

taion809 commented Dec 5, 2019

Hello,
Thanks for taking a look. I've attached the output here.

clrstack-aaf.txt

@janvorli
Copy link
Member

janvorli commented Dec 6, 2019

My guess is that the crashing thread is most likely the thread 6 (OS Thread Id: 0x93). It could also possibly be the thread 1, but it is definitely not any other.
Can you please run the following lldb commands and share the results?

thread select 1
disass
reg read
thread select 6
disass
reg read

@danmoseley
Copy link
Member

@taion809 as an aside note that 2.2 is out of support this month. You will likely want to move to 3.1 which will be in support for 3 years.

See https://dotnet.microsoft.com/platform/support/policy/dotnet-core (needs an update for the 3.1 release that just occurred)

@taion809
Copy link
Author

taion809 commented Dec 9, 2019

@janvorli attached thanks
@danmosemsft we're planning on moving, we had a dependency on the Azure BotFramework moving to dotnetcore 3, that has since resolved and we'll be planning on moving after january :)

disass.txt

@janvorli
Copy link
Member

janvorli commented Dec 9, 2019

Ok, so it is really the thread 1, the CLREventBase::Set() gets NULL "this" pointer. The actual place where the CLREventBase::Set is called is here:

pWaitEventLink->m_EventWait->Set();

On the call stack it looks a bit different, but that's just because of inlining.
That means that the m_EventWait of the WaitEventLink that we get from ThreadQueue::DequeueThread(this) is NULL. It could be due to a memory corruption or another bug.
I am not sure if that's something that was fixed in 3.0 or if it is something that is still unfixed and surfaces due to the specifics of your app.
How easy is to repro the issue in your app?

@taion809
Copy link
Author

taion809 commented Dec 9, 2019

We're running in docker containers that are being scheduled by nomad so it's not super easy to obtain a coredump or a procdump-on-crash due to running in unprivileged mode but is happening at roughly once or twice a day in production.

@SheetalShah23
Copy link

Is it possible that the thread was already dequeued

@taion809
Copy link
Author

taion809 commented Dec 9, 2019

I was able to reproduce this locally and obtained a minidump (as well as a linux coredump), however it appears trying to dump the stack causes lldb to crash with its own segfault.

clrthreads.txt, syncblk.txt, clrsaaf.txt, and regreadmini.txt were taken from the minidump
regreadlnx.txt was taken from the linux coredump

clrthreads.txt
syncblk.txt
clrsaaf.txt
regreadmini.txt
regreadlnx.txt

@janvorli
Copy link
Member

janvorli commented Dec 9, 2019

@taion809 thank you for the additional details. Can you also please dump the native stack traces using bt all?

@taion809
Copy link
Author

taion809 commented Dec 10, 2019

Here is the output, fyi we're running a soak test of the same code running dotnetcore 3.1, we will know more tomorrow.

btall.txt

@janvorli
Copy link
Member

@taion809 thank you for the stack traces. The failing thread is thread #15 here. Can you please get me output of the following in lldb?:

thread select 15
disass
reg read

@taion809
Copy link
Author

Hello,
Attached here

regread15.txt

@SheetalShah23
Copy link

any idea what might be the issue here?

@janvorli
Copy link
Member

@taion809 I am sorry for the late response. I've made a mistake in the instructions, it should have been:

thread select 15
f 4
disass
reg read

@jeffschwMSFT jeffschwMSFT added question Answer questions and provide assistance, not an issue with source code or documentation. area-VM-coreclr labels Jan 8, 2020
@jeffschwMSFT jeffschwMSFT added this to the Future milestone Jan 8, 2020
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 8, 2020
@taion809
Copy link
Author

Hello,
Sorry for the late response, I have attached the output here.

Thanks!

redread15.txt

@taion809
Copy link
Author

Hi,
As an update here, we've moved to dotnetcore 3.1 and we have not experienced this issue.

Thanks so much for all the help!

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests

7 participants