-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assert in ProcessWaitState on Linux arm64 #74795
Comments
Tagging subscribers to this area: @dotnet/area-system-diagnostics-process Issue Detailsrelated to #69125.
errno 10 ->
and with (some) symbols
cc: @tmds
|
We had this last year for This |
In such a case, it's a fail fast necessary? Would it be reasonable to just continue? |
Someone proposed such a change. We didn't take it because we obtain the exit code when we reap the child (from |
I read that comment as a general statement that we prefer to fail fast than continue in a corrupted or undefined state (cc @jkotas). In this case, can we reasonably recover from this error? It seems we use the error code to release resources -- can we just do that? Or is a leak inevitable here? It seems fragile to terminate the app suddenly in the case some child process terminates or is killed for some reason. I guess this does not happen on Windows. |
One another questions is what this tart happening now. It can certainly be race condition that always existed. I tried to reproduce it locally (on ARM64) but so far no luck. |
The corrupted state here is akin to calling The situation is not triggered by child process terminating or being killed for some reason. The situation is triggered by some code in the current process handling it incorrectly. It is also possible that the root cause has nothing to do with process reaping. Instead, the root cause may be a memory corruption somewhere else that this crash is a side-effect of. |
I did check the test suites and I don't see any other attempts to reap or execute besides our
I'm running the test suite in loop but as usually no repro. It is interesting that the dump above has |
The process exit handling code hasn't changed much in the past years. I don't expect it to have a bug as it has been used probably millions of times in various situations without users reporting the I think we should treat this as a glitch unless it reproduces more. |
I run it whole night on ARM machine without repro. However it fails somewhat often in CI. As @jkotas mentioned it may be data corruption. But I did not see any signs and it fails in same way on different OSes and versions. On the glitch path .... is there chance that is some kernel bug or behavior? I was thinking to run it ci with tracing enabled (and dump it on failure) but it seems like the process code does not really have any. What would we need to come to bottom of this (and how do we expect users to debug it?) runnings |
I thought this was a single occurrence. Since when did it start failing in CI? |
It feels like it start happening more in last few weeks. But as far as I can tell the #55618 was never solved, right? I see ~ 200+ failures in last two weeks. Mono was disabled in #74668 (see #74667) |
#55618 was reported over a year ago, and it was closed after the issue had no longer occurred for 3 weeks. Since then it wasn't reported until this issue was created this week. Having an idea of when it started to occur more can give us a clue as to what change triggered it.
We've only seen this on arm64? Does it occur less with coreclr than with mono? The code asks the kernel if there are children that terminated using
If there is one, it passes the
In between these calls the child suddenly goes missing. So either something has messed up the If it's something else reaping the child, it's expected to be a native library making a blocking If there is no native library doing this, then somehow the Note that the signal handling is running on is native pthread. It's not a managed thread. |
I see it recently only on ARM. Some of the older cases are gone e.g. there is note of crash on Kusto but we no longer have console so we cannot check if the pattern is the same. I searched our code base (and msquic) and I did not see any direct uses of Now, since you mentioned the |
actually, the runs are containers. I'm wondering if this is applicable: runtime/src/native/libs/System.Native/pal_signal.c Lines 349 to 354 in 3afb168
That would eventually call pid = Interop.Sys.WaitPidExitedNoHang(-1, out _); |
Yes, both of these can have the case root cause. Notice that both of these issues have been only seen in System.Net.Requests.Tests. One possible explanation is that a method passes a pointer to a stack location to some asynchronous logic. The method goes out of scope, and some completely unrelated code starts running. And finally, the asynchronous logic wakes up and overwrites the location on the stack that is used for something completely different by now. Does it ring any bells? Do we have a code in sockets that is passing a pointer to a stack location to some asynchronous logic? |
I will take a look. #72830 happens in System.Net.Mail. That has really only one functional change recently and I don't see how that would cause either of the problems. |
Right, but the mysterious crash is still in sockets. |
It isn't. It is applicable only for the main process in the container (pid 1). You can see I don't see any reports of this on x64. Was Mono disabled because it happens more on Mono than it does on coreclr? |
Our tests have a specific pattern that is prone to hit the lttng bug: The tests are initializing Quic on one thread while there is a remote executor exec storm of short-lived processes on other threads. I think that it is unlikely for real world applications out there to have the same pattern. Until we get lttng fix in place, we may want to set |
This is not just the docker, right? Since this is showing up so far only on ARM64, we may choose to update only that. I'm working on issue for LTTNG. They don't use GitHub for tracking and my registration to submit issue is pending for approval. I look at the code again and changing |
@wfurt do we understand why it is happening only on System.Net.Request on not on other test suites? |
Also, does it impact also release/7.0 branch? Should we backport it as test-only change? |
Any test that launches child processes and ends up loading Quic is prone to hitting this issue. I expect that we will see this problem in a few more tests over time. An alternative big hammer fix that would address it everywhere would be to set
Yes. |
Note this is initialization race. So when Quic is initialized befe test runs and child processes it is OK. That for example happens when we conditional tests where XUnit evaluates conditions before test run. |
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDisabled tests:
Crash in System.Net.Requests.Tests Work Item - Last 7 days in Runfo:
Related to #69125.
errno 10 ->
and with (some) symbols
cc: @tmds { "ErrorMessage":"Error while reaping child. errno = 10" } ReportSummary
|
reopening for 7.0 |
#75266 does not seems to make it to 7.0 so that part is missing the back-port PR. |
Yep, sorry - I looked accidentally only at 7.0 change where it was not disabled ... I updated to post and removed the disabled-test label. |
For reference, the lttng issue created by @wfurt: https://bugs.lttng.org/issues/1359. |
Disabled tests:
Crash in System.Net.Requests.Tests Work Item - Last 7 days in Runfo:
Related to #69125.
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/console.1429bd54.log?%3Fhelixlogtype%3Dresult
dump: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/core.1001.21
errno 10 ->
ENOCHILD
and with (some) symbols
cc: @tmds
Report
Summary
The text was updated successfully, but these errors were encountered: