Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[linux-arm64] System.ArgumentOutOfRangeException in AwaitableSocketAsyncEventArgs.InvokeContinuation #84407

Closed
kevingosse opened this issue Apr 6, 2023 · 17 comments

Comments

@kevingosse
Copy link
Contributor

kevingosse commented Apr 6, 2023

Description

We've seen a crash in our CI caused by an ArgumentOutOfRangeException thrown by System.Threading.ThreadPool+<>c.<.cctor>b__86_0(System.Object), itself invoked by System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation.

The full callstack:

System.Private.CoreLib.dll!System.ThrowHelper.ThrowArgumentOutOfRangeException(System.ExceptionArgument)+0x44
System.Private.CoreLib.dll!System.Threading.ThreadPool+<>c.<.cctor>b__86_0(System.Object)+0x48
System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation(System.Action`1<System.Object>, System.Object, Boolean, Boolean)+0x5c
System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs)+0xa8
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal()+0x58
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32, System.Net.Sockets.SocketFlags)+0x48
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.CompletionCallback(Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError)+0x24
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32, Byte[], Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError)+0x3c
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessAsyncOperation(System.__Canon)+0x134
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+ReadOperation.System.Threading.IThreadPoolWorkItem.Execute()+0x40
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext.HandleEvents(SocketEvents)+0x98
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()+0xd0
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()+0x324
System.Private.CoreLib.dll!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()+0x164
System.Private.CoreLib.dll!System.Threading.Thread.StartCallback()+0x68

I believe this is the same issue as #70486 and #72365.

It's entirely possible that the crash is caused by our instrumentation code (we use the profiler API to rewrite IL at runtime, though we don't rewrite any of the methods in this callstack). However, since the exact same issue has been reported by somebody else, and because in both cases it happened only on ARM64, it makes me think there is actually an issue in the runtime or the BCL.

We managed to capture a memory dump. I can share it if you have a place to upload it (47MB zipped).

Reproduction Steps

The issue occurs very rarely in our CI so I don't have a repro. This is a test application that sends HTTP requests to itself (HttpClient + HttpListener).

Expected behavior

No crashes.

Actual behavior

An ArgumentOutOfRangeException is thrown and crashes the process.

Regression?

We observed the issue on .NET 6. I don't know if previous versions are impacted.

Known Workarounds

No response

Configuration

.NET 6.0.15
Linux Debian/ARM64. The crash was never observed on x86/x64.

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 6, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 6, 2023
@kevingosse kevingosse changed the title [linux-arm64] System.ArgumentOutOfRangeException in System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation [linux-arm64] System.ArgumentOutOfRangeException in AwaitableSocketAsyncEventArgs.InvokeContinuation Apr 6, 2023
@kevingosse
Copy link
Contributor Author

kevingosse commented Apr 6, 2023

This is a wild guess as I'm unfamiliar with this code, but I feel like there is a race condition in AwaitableSocketAsyncEventArgs.

In OnCompleted(Action<object?> continuation, object? state, short token, ValueTaskSourceOnCompletedFlags flags):

Action<object>? prevContinuation = Interlocked.CompareExchange(ref _continuation, continuation, null);
if (ReferenceEquals(prevContinuation, s_completedSentinel))
{
    // Lost the race condition and the operation has now already completed.
    // We need to invoke the continuation, but it must be asynchronously to
    // avoid a stack dive.  However, since all of the queueing mechanisms flow
    // ExecutionContext, and since we're still in the same context where we
    // captured it, we can just ignore the one we captured.
    bool requiresExecutionContextFlow = _executionContext != null;
    _executionContext = null;
    UserToken = null; // we have the state in "state"; no need for the one in UserToken
    InvokeContinuation(continuation, state, forceAsync: true, requiresExecutionContextFlow);
}

The comment implies that we can get there while a continuation is already enqueued. If we fail to swap the previous continuation and find the sentinel instead, we assign null to UserToken.

However, in OnCompleted(SocketAsyncEventArgs _):

Action<object?>? c = _continuation;
if (c != null || (c = Interlocked.CompareExchange(ref _continuation, s_completedSentinel, null)) != null)
{
    Debug.Assert(c != s_completedSentinel, "The delegate should not have been the completed sentinel.");

    object? continuationState = UserToken;
    UserToken = null;
    _continuation = s_completedSentinel; // in case someone's polling IsCompleted

    ExecutionContext? ec = _executionContext;
    if (ec == null)
    {
        InvokeContinuation(c, continuationState, forceAsync: false, requiresExecutionContextFlow: false);
    }

First Interlocked.CompareExchange replaces the continuation with the sentinel, and then UserToken is read. So, assuming OnCompleted(Action<object?> continuation, object? state, short token, ValueTaskSourceOnCompletedFlags flags) and OnCompleted(SocketAsyncEventArgs _) can execute simultaneously (I don't know if that's actually the case, I just got that idea from the comment), then there is a small window of time where OnCompleted(Action<object?> continuation, object? state, short token, ValueTaskSourceOnCompletedFlags flags) can see the sentinel and assign null to UserToken before OnCompleted(SocketAsyncEventArgs _) can read it.

In the memory dump, state (stored in the ArgumentOutOfRangeException) and UserToken are null, and the sentinel is assigned to _continuation, so this is consistent with this explanation. No definitive proof though.

@stephentoub
Copy link
Member

stephentoub commented Apr 6, 2023

Thanks! How frequently do you see this in CI? This whole implementation was replaced in .NET 8 to instead use ManualResetValueTaskSourceCore. I'm wondering if you might be able to try with a recent preview or daily build and see if it still happens.

As for the theory, I don't think that's what's happening. Those two blocks are guarded on different conditions. The block in OnCompleted(SocketAsyncEventArgs) only runs if a continuation was already hooked up and the other one runs only if the operation completed before hooking up the continuation; they should never both execute for the same operation. As this is only happening on arm, it's more likely we're missing a necessary volatile somewhere... I exipect the issue is _continuation should be volatile or that fast path read on it should be a Volatile.Read, as otherwise the read of UserToken could actually be reordered to before we read the continuation.

@kevingosse
Copy link
Contributor Author

I checked for the past month and found only two other occurrences of the problem (out of ~34400 runs in total).

Unfortunately I don't think we can deploy .NET 8 just yet (we first need to update our instrumentation code to support it).

@ghost
Copy link

ghost commented Apr 6, 2023

Tagging subscribers to this area: @dotnet/area-system-threading-tasks
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We've seen a crash in our CI caused by an ArgumentOutOfRangeException thrown by System.Threading.ThreadPool+<>c.<.cctor>b__86_0(System.Object), itself invoked by System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation.

The full callstack:

System.Private.CoreLib.dll!System.ThrowHelper.ThrowArgumentOutOfRangeException(System.ExceptionArgument)+0x44
System.Private.CoreLib.dll!System.Threading.ThreadPool+<>c.<.cctor>b__86_0(System.Object)+0x48
System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation(System.Action`1<System.Object>, System.Object, Boolean, Boolean)+0x5c
System.Net.Sockets.dll!System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs)+0xa8
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal()+0x58
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.FinishOperationAsyncSuccess(Int32, System.Net.Sockets.SocketFlags)+0x48
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.CompletionCallback(Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError)+0x24
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.TransferCompletionCallbackCore(Int32, Byte[], Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError)+0x3c
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+OperationQueue`1[[System.__Canon, System.Private.CoreLib]].ProcessAsyncOperation(System.__Canon)+0x134
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext+ReadOperation.System.Threading.IThreadPoolWorkItem.Execute()+0x40
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncContext.HandleEvents(SocketEvents)+0x98
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()+0xd0
System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()+0x324
System.Private.CoreLib.dll!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()+0x164
System.Private.CoreLib.dll!System.Threading.Thread.StartCallback()+0x68

I believe this is the same issue as #70486 and #72365.

It's entirely possible that the crash is caused by our instrumentation code (we use the profiler API to rewrite IL at runtime, though we don't rewrite any of the methods in this callstack). However, since the exact same issue has been reported by somebody else, and because in both cases it happened only on ARM64, it makes me think there is actually an issue in the runtime or the BCL.

We managed to capture a memory dump. I can share it if you have a place to upload it (47MB zipped).

Reproduction Steps

The issue occurs very rarely in our CI so I don't have a repro. This is a test application that sends HTTP requests to itself (HttpClient + HttpListener).

Expected behavior

No crashes.

Actual behavior

An ArgumentOutOfRangeException is thrown and crashes the process.

Regression?

We observed the issue on .NET 6. I don't know if previous versions are impacted.

Known Workarounds

No response

Configuration

.NET 6.0.15
Linux Debian/ARM64. The crash was never observed on x86/x64.

Other information

No response

Author: kevingosse
Assignees: -
Labels:

area-System.Threading.Tasks, untriaged, needs-area-label

Milestone: -

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 6, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 14, 2023
@freddyrios
Copy link

I have seen something very similar in a raspberry pi. It is a hard to reproduce failure, in 10 pis running for 6 months we have seen it happen 4 times in 2 of the systems (2 each). All run the same code and are sending/receiving data on the network many times x second.

That said, our stack no longer matches the one in this issue after the 5th level. In all 4 cases it looks like this for us:

System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
   at System.Threading.ThreadPool.<>c.<.cctor>b__86_0(Object state)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
   at System.Net.Sockets.SocketAsyncEventArgs.OnCompletedInternal()
   at System.Net.Sockets.SocketAsyncEventArgs.ConnectCompletionCallback(SocketError socketError)
   at System.Net.Sockets.SocketAsyncContext.ConnectOperation.InvokeCallback(Boolean allowPooling)
   at System.Net.Sockets.SocketAsyncContext.WriteOperation.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
%m.%d %T:

However, since 4 levels are the same and the first is a method without arguments (OnCompletedInternal), it is reasonable to think these are related.

@antoniofreire
Copy link

Hello.

I've also got this exception but in the System.Threading.Channels AsyncOperation class when trying to write a item to a channel. It's rare, but, when it occurs, it's stalls the consumer side of the channel since the Reader.WaitToReadAsync() method never returns. Could the issue found on AwaitableSocketAsyncEventArgs also be present here?

Unfortunately, I can only send the part of the stack trace that originated the exception.

at System.Threading.Channels.AsyncOperation`1.SignalCompletion()
at System.Threading.Channels.BoundedChannel`1.BoundedChannelWriter.TryWrite(T item)

My app also runs on linux-arm64

@stephentoub
Copy link
Member

I have seen something very similar in a raspberry pi.

On what .NET version? This was recently fixed (we think).

I've also got this exception but in the System.Threading.Channels AsyncOperation class

On what .NET version?

@freddyrios
Copy link

freddyrios commented May 6, 2023

@stephentoub 6. Do you mean the fix is in a patch for 6 or are you referring to fixed in 7?

@stephentoub
Copy link
Member

6.0.which?

@stephentoub
Copy link
Member

It should be fixed in a patch for 6 (and 7).

@antoniofreire
Copy link

On what .NET version?

7.0.5

@stephentoub
Copy link
Member

On what .NET version?

7.0.5

And you're getting an ArgumentOutOfRangeException from the call to SignalCompletion in TryWrite, with no further frames on the stack in SignalCompletion?

@freddyrios
Copy link

freddyrios commented May 8, 2023

6.0.which?

The app was built with the 6.0.406 SDK via this runner of github actions https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20230219.1.

It is a single file - self contained app, so I understand the runtime should be then 6.0.14 https://dotnet.microsoft.com/en-us/download/dotnet/6.0

@LarsWithCA
Copy link

Hi @stephentoub which version of 6 has this patch? thanks

@karelz
Copy link
Member

karelz commented May 27, 2023

Fixed in 7.0.7 in PR #84641 and in 6.0.18 in PR #84432.
Main (8.0) is not affected - see description in #84432.

@rodzac
Copy link

rodzac commented Jun 16, 2023

hey @stephentoub, I was reading PR #84432, and the fix was to use the volatile keyword. I'm a dotnet/C# newbie person, and I was curious to understand the keyword impact and I found some a lot people discussing about it.

Is this fix a solution for multi-processor use cases or just when both threads run on the same processor?

@stephentoub
Copy link
Member

hey @stephentoub, I was reading PR #84432, and the fix was to use the volatile keyword. I'm a dotnet/C# newbie person, and I was curious to understand the keyword impact and I found some a lot people discussing about it.

Is this fix a solution for multi-processor use cases or just when both threads run on the same processor?

Compilers and hardware are free to reorder reads/writes in ways that are unobservable for the current thread. But such reorderings can be visible to other threads running concurrently. Marking a variable as volatile tells the system to use read-acquire / store-release semantics, preventing those reorderings.

The .NET memory model is described at

## ECMA 335 vs. .NET memory models
.

The volatile keyword is described at https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/volatile.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants