Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App crashes with an output "Trace/Breakpoint Trap" on Linux when a P/Invoke callback is called from a native library if the dotnet debugger is attached. #104459

Open
walterlv opened this issue Jul 5, 2024 · 20 comments

Comments

@walterlv
Copy link

walterlv commented Jul 5, 2024

Description

  1. Write a .NET 8 application that calls a native library using P/Invoke with a callback.
  2. Run the app, then attach the dotnet debugger before the callback is called.
  3. We'll see an output "Trace/Breakpoint Trap" and the app crashes.

Note: Not all native callbacks cause this issue so I've written a minimal reproducible example below.

Reproduction Steps

Minimal reproducible example 1:

  1. Clone this repo: https://github.com/walterlv/Walterlv.Issues.TraceBreakpointTrap
  2. build the demo to a linux machine
  3. Run the app, then attach the dotnet debugger.
dotnet publish -c debug -r linux-x64 --self-contained
$ ./TraceBreakpointTrapDemo
### Trace/Breakpoint Trap issue on .NET debugger ###
Please attach a dotnet debugger and use 'Set next statement'.
Trace/breakpoint trap

Reproducible example 2:

Expected behavior

The app should not crash when the dotnet debugger is attached.

Actual behavior

The app crashes with an output "Trace/Breakpoint Trap".

Regression?

I've only tested this on .NET 8.0.302

Known Workarounds

I've found several workarounds:

  1. Detect if the debugger is attached and don't call the callback.
  2. Use the "Native (GDB)" or "Native (LLDB)" debugger instead of the "Managed (.NET Core for Unix)" debugger.

Note:

  • The Debugger.IsAttached property cannot detect the native debugger so I added alternative options --sleep <seconds> and --skip-attach for the minimal reproducible example above.
  • The native debugger is very difficult to use, so I hope this issue can be fixed.

Configuration

  • .NET: 8.0.302
  • OS:
    • Ubuntu 22.04 LTS
    • Debian 12
    • UnionTech OS GNU/Linux 20
    • Kylin V10 SP1
  • Architecture:
    • x64
    • ARM64

I didn't find any environment that doesn't have this issue.

Other information

  1. dotnet tool install -g dotnet-sos
  2. dotnet sos install
  3. ulimit -c unlimited
  4. Run echo "0x3F"> /proc/<pid>/coredump_filter after the process starts and the pid is known.
  5. Attach the debugger and wait for the output Trace/Breakpoint Trap (core dumped).
  6. lldb --core core TraceBreakpointTrapDemo
$ lldb --core core TraceBreakpointTrapDemo
SOS_HOSTING: Failed to find runtime directory
Unrecognized command 'setsymbolserver' because managed hosting failed or was disabled. See sethostruntime command for details.
(lldb) target create "TraceBreakpointTrapDemo" --core "core"
Core file '/home/uos/lvyi/Walterlv.Issue.TraceBreakpointTrap/core' (x86_64) was loaded.
(lldb) clrstack
OS Thread Id: 0x7ef9 (1)
        Child SP               IP Call Site
00007F4AF37DBA38 00007F4AF45F3B41 Walterlv.Issues.TraceBreakpointTrap.VolumeManager.ContextStateCallback(IntPtr, IntPtr)
(lldb) bt
* thread #1, name = 'TraceBreakpoint', stop reason = signal SIGTRAP
  * frame #0: 0x00007f4af45f3b41
    frame #1: 0x00007f4b6ba904f9 libpulse.so.0`___lldb_unnamed_symbol12$$libpulse.so.0 + 73
    frame #2: 0x00007f4b6ba93002 libpulse.so.0`___lldb_unnamed_symbol28$$libpulse.so.0 + 514
    frame #3: 0x00007f4b6ba931d2 libpulse.so.0`___lldb_unnamed_symbol29$$libpulse.so.0 + 98
    frame #4: 0x00007f4b6ba459b2 libpulsecommon-14.2.so`___lldb_unnamed_symbol101$$libpulsecommon-14.2.so + 258
    frame #5: 0x00007f4b6baa63c0 libpulse.so.0`pa_mainloop_dispatch + 672
    frame #6: 0x00007f4b6baa65cc libpulse.so.0`pa_mainloop_iterate + 60
    frame #7: 0x00007f4b6baa6670 libpulse.so.0`pa_mainloop_run + 32
    frame #8: 0x00007f4b6bab43f9 libpulse.so.0`___lldb_unnamed_symbol111$$libpulse.so.0 + 105
    frame #9: 0x00007f4b6ba51628 libpulsecommon-14.2.so`___lldb_unnamed_symbol119$$libpulsecommon-14.2.so + 88
    frame #10: 0x00007f4b73452fa3 libpthread.so.0`start_thread(arg=<unavailable>) at pthread_create.c:486
    frame #11: 0x00007f4b7305d60f libc.so.6`__GI___clone at clone.S:95
(lldb) dis
->  0x7f4af45f3b41: subq   $0x20, %rsp
    0x7f4af45f3b45: leaq   0x20(%rsp), %rbp
    0x7f4af45f3b4a: movq   %rdi, -0x8(%rbp)
    0x7f4af45f3b4e: movq   %rsi, -0x10(%rbp)
    0x7f4af45f3b52: movq   %rdx, -0x18(%rbp)
    0x7f4af45f3b56: cmpl   $0x0, 0x897d3(%rip)
    0x7f4af45f3b5d: je     0x7f4af45f3b64
(lldb) 
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 5, 2024
Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@tommcdon
Copy link
Member

tommcdon commented Jul 8, 2024

Hi @walterlv! Thanks for reporting this bug!

I didn't find any environment that doesn't have this issue.

Do you know if this issue reproduces on Windows?

@tommcdon
Copy link
Member

tommcdon commented Jul 8, 2024

Do you know if this issue reproduces on Windows?

Ahh nevermind this question as the repro is very specific to linux.

Do you know if the callback/debugging issue is specific to the libpulse API (e.g. does a standalone repo that uses callback from C++ to C# on Linux reproduce the issue)? I am curious if there is something specific to libpulse that is causing the problem, for example a difference in calling convention, etc...

@lindexi
Copy link
Contributor

lindexi commented Jul 9, 2024

@tommcdon I can repro this issues by @walterlv 's repo in my linux system. And I can sure it's not the libpulse bug, because I can repro this issues with https://github.com/Haltroy/CefGlue


I can not reproduce on Windows because I fail to run the libpulse on Windows... I mean I do not know if it can be reproduced on Windows.

@tommcdon
Copy link
Member

tommcdon commented Jul 9, 2024

Possible duplicate to #102767. @hoyosjs

@walterlv
Copy link
Author

walterlv commented Jul 10, 2024

Thanks to my friend @kkwpsv, he helped me to find out more information about this issue.

@tommcdon This issue is quite different from #102767:

  1. This issue is related to the dotnet debugger on linux (and only on linux).
  2. This issue might not related to the callback but I can't figure out whether it is or not.

Let's see more details here.

  1. Debug run the app using a dotnet debugger (I was using the JetBrains Rider linux version) and let the app stops at a breakpoint.
  2. Attach lldb to the running process.
  3. Continue the app in the dotnet debugger.
  4. Continue the app in the lldb debugger.

Then,

  1. See all the threads in the lldb debugger using thread backtrace all and we that thread 3 .NET EventPipe is stopped with signal SIGTRAP
  2. Resume the app and the thread 3 receives a detail signal signal SIGSEGV: address not mapped to object (fault address: 0xbafa13a0).

The stack traces are shown as follows:

image

image

[UnmanagedCallersOnly]
private static unsafe void Callback(byte* sourceId, int isEnabled, byte level,
    long matchAnyKeywords, long matchAllKeywords, Interop.Advapi32.EVENT_FILTER_DESCRIPTOR* filterData, void* callbackContext)
{
    EventPipeEventProvider _this = (EventPipeEventProvider)GCHandle.FromIntPtr((IntPtr)callbackContext).Target!;
    if (_this._eventProvider.TryGetTarget(out EventProvider? target))
    {
        _this.ProviderCallback(target, sourceId, isEnabled, level, matchAnyKeywords, matchAllKeywords, filterData);
    }
}

@tommcdon tommcdon added this to the 9.0.0 milestone Jul 20, 2024
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jul 20, 2024
@tommcdon
Copy link
Member

@hoyosjs

@mdh1418
Copy link
Member

mdh1418 commented Aug 9, 2024

Hi @walterlv and @lindexi,

We haven't been able to repro the exact issue from your repros yet, but the SIGSEGV for the EventPipeEventProvider callback looks eerily similar to #80666 (comment), where the _gchandle used in the callback had been freed before the callback completes.

If the dotnet debugger is hitting the same EventPipeEventProvider Callback issue, then there is a partial fix already merged through #106040 and a second PR #106156 that is open

@tommcdon tommcdon modified the milestones: 9.0.0, 10.0.0 Aug 9, 2024
@lindexi
Copy link
Contributor

lindexi commented Aug 10, 2024

@mdh1418 Thank you. What VisualStudio version and dotnet version you use? And do you debug the application run on Linux?

Can I test the daily dotnet version which merged #106040 ?

@tommcdon
Copy link
Member

What VisualStudio version and dotnet version you use? And do you debug the application run on Linux?

We used the latest version of the C# extension in VS Code

Can I test the daily dotnet version which merged #106040 ?

Yes - the daily builds from https://github.com/dotnet/sdk/blob/main/documentation/package-table.md contain the fix.

@kkwpsv
Copy link

kkwpsv commented Aug 12, 2024

@tommcdon I test again with https://aka.ms/dotnet/9.0.1xx/daily/dotnet-sdk-linux-x64.tar.gz.
There is no SIGSEV now. The process still exits with SIGTRAP.

I debugged it with lldb. Here's the output:
image

@jwilliamsonveeam
Copy link

Seems like the same problem I'm seeing here: microsoft/DockerTools#444

@lindexi
Copy link
Contributor

lindexi commented Sep 28, 2024

@jwilliamsonveeam Sorry, the microsoft/DockerTools#444 is too long, I'm afraid I'm missing out on important information.

@jwilliamsonveeam
Copy link

@lindexi I updated my last comment with a small self contained example of a program that fails with a sigtrap in the native c code callback.
microsoft/DockerTools#444 (comment)
and a zip of the whole solution is in this thread if you have access.
https://developercommunity.visualstudio.com/t/dotnet-process-silently-crashes-when-deb/10740222?

@Alxe
Copy link

Alxe commented Oct 1, 2024

I've run @walterlv's reproducer (Walterlv.Issues.TraceBreakpointTrap) and reproduced the issue as well.

I've been debugging a similar issue where the scenario is as follows:

  • A C# callback (annotated with UnmanagedFunctionPointer) is sent to a C function through P/Invoke (annotated with DllImport).
  • The C code is run in a thread distinct from the one that installed the C# callback.
  • If the debugger is attached when the C# callback is executed for the first time, the application crashed with a SIGTRAP.
  • If the debugger is attached after the C# callback has been executed once, the application works correctly.

Using @walterlv's reproducer as a base, I've modified it with these changes and managed to avoid the crash. The output from my execution is as follows:

$ ./artifacts/bin/Walterlv.Issues.TraceBreakpointTrap/debug/TraceBreakpointTrapDemo --skip-attach
### Trace/Breakpoint Trap issue on .NET debugger ###

Context state changed: 1
If you want to debug this demo using other debuggers (e.g. GDB, LLDB), you can use the following options:

  --sleep <seconds>  Sleep for a while before attaching debugger.
  --skip-attach      Skip attaching debugger and run directly.

Please attach a dotnet debugger and use 'Set next statement'.
Context state changed: 2
Context state changed: 3
Context state changed: 4
Context state changed: 5
Issue may not be reproduced. Exit.

In the output, changes 1 to 4 are from before the debugger is attached. Once the debug is attached, change 5 is printed but there's no crash.

Additionally, in my own (non-shareable) projects, I've been able to use a C debugger (lldb or gdb) to manually call the callback (through a function pointer) directly from the debugger. This led to the C# application throwing the following error:

Fatal error. Invalid Program: attempted to call a UnmanagedCallersOnly method from managed code.

This error is seemingly thrown here, but I don't have a fine understanding of the dotnet runtime.
However, it leads me to believe that the key is that there are two distinct threads.

@janvorli
Copy link
Member

janvorli commented Oct 1, 2024

  • If the debugger is attached when the C# callback is executed for the first time, the application crashed with a SIGTRAP.
  • If the debugger is attached after the C# callback has been executed once, the application works correctly.

I think this may have revealed the culprit. The thing is that .NET runtime only handles signals when the thread those occurred on are known to the runtime. That means that they were either created by the runtime or called into the runtime. If the debugger sets the breakpoint on the UnmanagedCallersOnly marked method before it calls into the runtime and registers the thread as one that runs managed code, the SIGTRAP would not call the handler in the runtime and it would invoke the default signal handler that terminates the process.

This error is seemingly thrown here

This code is for NativeAOT, in coreclr, the error comes from here:

extern "C" VOID STDCALL ReversePInvokeBadTransition()
{
STATIC_CONTRACT_THROWS;
STATIC_CONTRACT_GC_TRIGGERS;
// Fail
EEPOLICY_HANDLE_FATAL_ERROR_WITH_MESSAGE(
COR_E_EXECUTIONENGINE,
W("Invalid Program: attempted to call a UnmanagedCallersOnly method from managed code.")
);
}

@Alxe
Copy link

Alxe commented Oct 1, 2024

@janvorli Hello and thanks for your input!

I'll be reviewing the ReversePInvokeBadTransition function, as I think I already added a native breakpoint there (it's a extern "C" function) and was able to hit it once.

However, I'd like to point out that the yet-unregistered thread is receiving a SIGTRAP regardless of whether I had a .NET breakpoint or not. Is there anything relevant that the debugger could be doing on thread registration? Could you share some links to code?

@jwilliamsonveeam
Copy link

jwilliamsonveeam commented Oct 1, 2024

https://github.com/jwilliamsonveeam/TimerCallBackDemo
I created a repo with my failing case. I also do not need any breakpoints in order for this to fail with a SIGTRAP with the debugger attached.

@janvorli
Copy link
Member

janvorli commented Oct 1, 2024

The debugger can set some breakpoints on its own for its internal purposes. @tommcdon would most likely know if it can be the case here.

@Alxe
Copy link

Alxe commented Oct 7, 2024

@janvorli If the debugger is setting its own breakpoint (e.g. on managed-to-unmanaged transitions) and then reaching it before the thread is properly registered with the .NET runtime (e.g. on the first .NET interaction of a thread), then the SIGTRAP and subsequent crash would make sense.

@tommcdon Could you please confirm if my assumption is correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants