Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpHeaders thread safety and behavior of invalid values #61798

Closed
MihaZupan opened this issue Nov 18, 2021 · 27 comments · Fixed by #68115
Closed

HttpHeaders thread safety and behavior of invalid values #61798

MihaZupan opened this issue Nov 18, 2021 · 27 comments · Fixed by #68115
Labels
area-System.Net.Http tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@MihaZupan
Copy link
Member

The HttpHeaders collections were never thread-safe. Accessing a header may force lazy parsing of its value, resulting in modifications to the underlying data structures.

Before .NET 6.0, reading from the collection concurrently happened to be thread-safe in most cases.
If the collection contained an invalid header value, that value would be removed during enumeration. In rare cases, this could cause problems from modifying the underlying Dictionary concurrently.

Starting with 6.0, less locking is performed around header parsing as it was no longer needed internally (#54130).
Due to this change, multiple examples of users accessing the headers concurrently surfaced. Currently known ones to the team are gRPC's .NET client (#55898, #55896) and NewRelic's HttpClient instrumentation (newrelic/newrelic-dotnet-agent#803).
Violating thread safety in 6.0 may result in the header values being duplicated/malformed or various exceptions being thrown during enumeration/header accesses.

I am posting a few known call stacks below to help other users hitting this issue discover the root cause:

IndexOutOfRangeException
System.IndexOutOfRangeException: Index was outside the bounds of the array.
    at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
    at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue)
    at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info)
    at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext()
NullReferenceException
System.NullReferenceException: Object reference not set to an instance of an object.
    at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
    at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue)
    at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info)
    at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext()
InvalidCastException
InvalidCastException: Unable to cast object of type 'System.Collections.Generic.List`1[System.Object]' to type 'System.Net.Http.Headers.MediaTypeHeaderValue'.
    at System.Net.Http.Headers.HttpContentHeaders.get_ContentType()

There is precedent in concurrent reads being thread-safe on some collections (namely Dictionary) and users may assume HttpHeaders behaves the same.

A related issue here is that forcing the validation of an invalid value will result in its removal from the collection. These Schrödinger's headers are difficult to reason about as their existence depends on whether/how/when they are observed.

We can look to make the behavior more intuitive. We should investigate whether we can (efficiently) implement the collection in a way that satisfies the following:

  • concurrent reads are thread-safe and return the same result to all readers
  • concurrent write/modification/remove operations are not supported and their behavior continues to be undefined
  • a "validating read" of an invalid value does not result in the removal of the invalid value. The invalid value will still be visible to the NonValidated view of the collection

Should enumerating the collection (with or without validation) while a concurrent write operation is in progress have defined behavior?

  • We can choose to document that concurrent reads are supported, but not while a write operation is in progress. This matches Dictionary's thread-safety model
  • Concurrent reads are expected to occur over the response headers where write operations are uncommon
@MihaZupan MihaZupan added this to the 7.0.0 milestone Nov 18, 2021
@ghost

This comment has been minimized.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Nov 18, 2021
@MihaZupan MihaZupan removed the untriaged New issue has not been triaged by the area owner label Nov 18, 2021
@amin-lu-xero
Copy link

amin-lu-xero commented Nov 19, 2021

I have an issue when I called DynamoDB as below. They looks like the same issue about http headers.

Screen Shot 2021-11-19 at 11 29 17 AM

I'm using AWSSDK.DynamoDBv2

@MihaZupan
Copy link
Member Author

@amin-lu-xero the root cause is definitely the same, but looking at the AWS SDK code, it doesn't look like the problem is on their side.
Are you perhaps using NewRelic for monitoring your application?

@amin-lu-xero
Copy link

Yes, we are using NewRelic for APM

@MihaZupan
Copy link
Member Author

They have fixed the issue in their instrumentation newrelic/newrelic-dotnet-agent#803 - please try updating their agent to the 9.2.0 version that was released a few hours ago.

@amin-lu-xero
Copy link

Thanks a lot !

@Strepto
Copy link

Strepto commented Feb 15, 2022

I'm seeing this exact issue in a dotnet 6 console app running from dotnet run on a TeamCity Agent. It seems very random and triggers in about 1/10 sessions when I upload around 1000 tiny files to the blob container.

I'm using Azure.Storage.Blobs v12.10.0 (Latest at the time of writing), and no other NuGet packages.

Source code (slightly modified): (Inspired by this official example: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-scalable-app-upload-files?tabs=dotnet )

var blobContainerClient = new BlobContainerClient(new Uri(assetUploadToken.UploadDestination));
BlobUploadOptions uploadOptions = new BlobUploadOptions()
{
    TransferOptions = new StorageTransferOptions()
    {
        MaximumConcurrency = 8
    }
};
var tasks = new Queue<Task<Response<BlobContentInfo>>>();

foreach (string sourceFilePath in Directory.GetFiles(fileDirectoryPath, "*", searchOption)
                .Where(x => regexFileExtensions.Match(x).Success))
{
    var fileToUpload = new FileInfo(sourceFilePath);
    var blobTargetPath = $"my-files";
    var blobClient = blobContainerClient.GetBlobClient(blobTargetPath);
    var uploadTask = blobClient.UploadAsync(fileToUpload.FullName, uploadOptions);
    tasks.Enqueue(uploadTask);
}

// Run all the tasks asynchronously. The `uploadOptions` will handle concurrency.
await Task.WhenAll(tasks);

Console.WriteLine($"Uploaded {tasks.Count} files.");

Stack Trace:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
-- SNIP (See below for full trace)
Full stack trace
System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
   at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue)
   at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info)
   at System.Net.Http.Headers.HttpHeaders.TryGetValues(HeaderDescriptor descriptor, IEnumerable`1& values)
   at System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync(Boolean async, HttpRequestHeaders headers, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Azure.Core.Pipeline.HttpClientTransport.ProcessAsync(HttpMessage message, Boolean async)
   at Azure.Core.Pipeline.ResponseBodyPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async)
   at Azure.Core.Pipeline.HttpPipelineSynchronousPolicy.<ProcessAsync>g__ProcessAsyncInner|4_0(HttpMessage message, ReadOnlyMemory`1 pipeline)
   at Azure.Core.Pipeline.RedirectPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async)
   at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async)
   at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async)
   at Azure.Storage.Blobs.BlockBlobRestClient.UploadAsync(Int64 contentLength, Stream body, Nullable`1 timeout, Byte[] transactionalContentMD5, String blobContentType, String blobContentEncoding, String blobContentLanguage, Byte[] blobContentMD5, String blobCacheControl, IDictionary`2 metadata, String leaseId, String blobContentDisposition, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 tier, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, String blobTagsString, Nullable`1 immutabilityPolicyExpiry, Nullable`1 immutabilityPolicyMode, Nullable`1 legalHold, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.UploadInternal(Stream content, BlobHttpHeaders blobHttpHeaders, IDictionary`2 metadata, IDictionary`2 tags, BlobRequestConditions conditions, Nullable`1 accessTier, BlobImmutabilityPolicy immutabilityPolicy, Nullable`1 legalHold, IProgress`1 progressHandler, String operationName, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.BlockBlobClient.<>c__DisplayClass62_0.<<GetPartitionedUploaderBehaviors>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Storage.PartitionedUploader`2.UploadInternal(Stream content, TServiceSpecificArgs args, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.BlobClient.StagedUploadInternal(Stream content, BlobUploadOptions options, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.BlobClient.UploadAsync(String path, BlobUploadOptions options, CancellationToken cancellationToken)
   at UploadArtifacts.UploadArtifacts.ModelDistributionApiFetcher.UploadFiles(AssetUploadToken assetUploadToken, String fileDirectoryPath, String fileExtensionRegex, Boolean searchAllSubDirectories) 

Should I also report this to the Azure.Storage.Blobs team? See: Azure/azure-sdk-for-net#27018

@MihaZupan
Copy link
Member Author

@Strepto are you by chance using a NewRelic agent on the machine running this code?

@Strepto
Copy link

Strepto commented Feb 15, 2022

No, we do not use NewRelic on the machine(s).

We do have TeamCitys Performance Monitor enabled though, if performance analysis might have an impact. https://www.jetbrains.com/help/teamcity/performance-monitor.html

Edit: I have now disabled TeamCitys Performance Monitor, and will monitor the jobs.

@MihaZupan
Copy link
Member Author

@Strepto I opened #65379 as a possible explanation for the behavior you observed. Would it be possible for you to test the scenario without using a proxy?

@Strepto
Copy link

Strepto commented Feb 15, 2022

@Strepto I opened #65379 as a possible explanation for the behavior you observed. Would it be possible for you to test the scenario without using a proxy?

I cannot disable the proxy on the machines the error occurs. But I can see if I can try running the scenario on my dev-machine, which has no proxy.

@karelz
Copy link
Member

karelz commented Feb 17, 2022

Triage: We will keep it open for few more months to catch other use cases like the one in #65379. Then close it.

@toddca
Copy link

toddca commented Mar 3, 2022

I believe I am hitting this issue using RestSharp 107.3.0 #restsharp/RestSharp#1772

@a-jackson
Copy link

I'm hitting this issues with concurrent requests on AWS SDK for SQS. Is there any workaround? We are using a proxy where this error is seen and not getting on dev machine where there is no proxy.

@MihaZupan
Copy link
Member Author

@a-jackson are you saying you are hitting #65379 specifically?

@a-jackson
Copy link

@MihaZupan I would guess that's the problem I'm having. I can't say say specifically that's it exactly. I only have error logs to go on. This is only reproducible on our servers where I'm unable to debug.

I've seen all the exceptions listed in the top post here as well as one from AWS "The request signature we calculated does not match the signature you provided." Where it appears the headers from one request are with the body of another or something to that effect although I can't confirm that.

@MihaZupan
Copy link
Member Author

Are you able to share stack traces when this happens?

@a-jackson
Copy link

a-jackson commented Mar 10, 2022

This is the most recent, it's not exactly one of the ones above but I think it's the same thing. I have seen the other errors as well but we don't keep logs that long on our test servers.

Stack trace
System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
   at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue)
   at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info)
   at System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync(Boolean async, HttpRequestHeaders headers, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
   at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)

@MihaZupan
Copy link
Member Author

MihaZupan commented Mar 10, 2022

There are two main known issues users are hitting here:

  • NewRelic reading response headers concurrently - you will see exceptions around response headers after the request is sent
    • There may be other components doing this, but this is the currently known one
  • CONNECT tunnel bug in HttpConnectionPool - you will see exceptions around request headers (even before the request is sent)

In your case, it looks like you are hitting the latter - #65379.

There are possible mitigations here if you have access to how the HttpClient is created / the HttpRequestMessage or it's headers before it's sent. Is that the case with the AWS SDK?

@a-jackson
Copy link

I agree this looks like the latter case.
It looks like it is possible to provide a custom HttpClient factory. What do I need to do?

@MihaZupan
Copy link
Member Author

MihaZupan commented Mar 28, 2022

Hey @a-jackson, sorry I didn't see your reply sooner.
If you have access to the handler chain, you can include a custom handler that will prevent the race condition.

var socketsHttpHandler = new SocketsHttpHandler();
var customHandler = new CustomHandler(socketsHttpHandler);
var httpClient = new HttpClient(customHandler);

public sealed class CustomHandler : DelegatingHandler
{
    public CustomHandler(HttpMessageHandler innerHandler) : base(innerHandler) { }

    protected override HttpResponseMessage Send(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        _ = request.Headers.TryGetValues("User-Agent", out _); // Force parsing

        return base.Send(request, cancellationToken);
    }

    protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        _ = request.Headers.TryGetValues("User-Agent", out _); // Force parsing

        return base.SendAsync(request, cancellationToken);
    }
}

@a-jackson
Copy link

@MihaZupan Thanks. I've just done some testing and that seems to work. I'm going to start rolling it out to our projects where needed.

Is there any idea when this will be fixed? I see the milestone of this issue is set for 7.0.0 but we were hoping we could stick to LTS releases. Is there any chance this will be fixed in 6.x? We have several projects still working on upgrades to 6 which we've put on hold due to this.

@MihaZupan
Copy link
Member Author

MihaZupan commented Mar 29, 2022

@a-jackson Is the above fix sufficient to unblock your upgrades to 7.0?

The issue will have to be addressed in main (currently planned for 7.0) first.

Given we've only heard of two users hitting this, and that a workaround exists, I think it's unlikely we would service an eventual fix for this at this time.

cc: @karelz regarding any potential servicing changes here (fix for #65379).

@a-jackson
Copy link

I can understand that if it's only two users, although we've had issues on every system running .Net 6.

I don't know yet if we'll continue with upgrades. We are in the process of phasing out using a proxy at all so we'll probably hold further upgrades until then.

I have another exception though that may or may not be related, just to give you some more info.

We were getting TimeoutException on requests to Dynamo DB around 5% of requests when using the proxy. The exception was 100s after the request. We tested without the proxy and didn't see a single error. These requests were not logged at the proxy at all and there's nothing else we could explain it with. Other requests, even from the same process, were being logged through the proxy at the same time which is why I thought it is probably a different manifestation of the same issue.

Stack trace
System.TimeoutException: A task was canceled.
---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
--- End of inner exception stack trace ---
at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)

@MihaZupan MihaZupan added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Mar 30, 2022
@MihaZupan
Copy link
Member Author

System.Threading.Tasks.TaskCanceledException: A task was canceled.
at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync
at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync

This stack indicates that the request was indeed never sent -- it timed out before it could obtain a working connection.
While the headers issue may cause connection attempts to fail more often (if the exception occurs inside System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync like in your previous stack trace), it should never cause requests to hang without trying to establish a connection / failing outright.

The presence of this stack trace indicates that either:

  • a) the proxy was taking a really long time to accept connections
    • consider setting the ConnectTimeout on SocketsHttpHandler to confirm (e.g. to 15 seconds)
  • b) You have the MaxConnectionsPerServer set and connections weren't able to keep up with incoming requests, resulting in some requests timing out
  • c) The connection pool got stuck somehow and never tried to establish the connection, while also failing to report an exception. I doubt this is happening, and it would indicate a different underlying issue.

@a-jackson
Copy link

a) Dynamo DB's client code appears to set a Connect Timeout of 3100ms. If it were this the exception would have been thrown after that long wouldn't it? Not 100s as we saw. Also the proxy logs indicated continuous activity throughout this period and we upgraded the instances it runs on to test that - it made no difference to the frequency of errors.
b) We're not setting MaxConnectionsPerServer and as far as I can tell neither is the Dynamo DB client. There's an option for it which is not set.
c) It looks like this to me but I wouldn't know where to begin with proving that.

We have just moved to using VPC Endpoints in AWS so Dynamo DB traffic is no longer using the proxy and we are not seeing this error now so it's not a problem for us any more.

@MihaZupan
Copy link
Member Author

MihaZupan commented Mar 30, 2022

If it were this the exception would have been thrown after that long wouldn't it? Not 100s as we saw

If MaxConnectionsPerServer isn't set and ConnectTimeout is set to ~3 seconds, you shouldn't ever hit the 100s timeout before getting a connection/failing.

It's hard to say what exactly is happening without detailed info about what the connection pool was doing. If you were still interested in investigating this further, would you be able to provide traces from an app hitting such failures (would potentially include PII)?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 16, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants