Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeedIterator.ReadNextAsync bug #3047

Closed
bengabayAtM opened this issue Feb 23, 2022 · 11 comments
Closed

FeedIterator.ReadNextAsync bug #3047

bengabayAtM opened this issue Feb 23, 2022 · 11 comments
Labels

Comments

@bengabayAtM
Copy link

Hi following the request from Azure cosmos db internal discussion I am fileing an issue here regarding a bug we are facing in PRD, we are having an issue regarding ReadNextAsync method: FeedIterator.ReadNextAsync(CancellationToken) Method (Microsoft.Azure.Cosmos) - Azure for .NET Developers | Microsoft Docs
we are using the Feed Iterator class:
using (FeedIterator<ChangeFeedProcessorState> estimatorIterator = changeFeedEstimator.GetCurrentStateIterator(m_changeFeedEstimatorRequestOptions))

where MaxItem = null in m_changeFeedEstimatorRequestOptions.
We then call ReadNextAsync

FeedResponse<ChangeFeedProcessorState> states = await estimatorIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false);

where the cancellationToken should be canceled after 50sec:

image

What ended up happening is that we call ReadNextAsync and never get a response from that method and while waiting for the response we are suffering from 100% CPU usage:

we tried updating the SDK version to the latest and adding a timeout task on our side, non of them solved the issue.
we rolled back to the latest working release we have and the issue still remains for some reason even though it never happened before in the current release deployed.

usually, after a couple of minutes sometimes could be even 30minutes we get an error thrown back from the ReadNextAsync method of one of the following types:
ServiceUnavailable (503)
NotFound (404)
RequestTimeout (408)

here is one of the latest errors we got for example:

GetChangeFeedEstimationForSingleCollectionAsync: Ctx-1df0279b, Collection cyberprofiles: Failed to get change feed estimation. Elapsed: 657196. StatusCode: ServiceUnavailable, Exception:

Exception Details
Exception Type: Microsoft.Azure.Cosmos.CosmosException
Message: Response status code does not indicate success: ServiceUnavailable (503); Substatus: 0; ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67; Reason: (Service is currently unavailable. More info: https://aka.ms/cosmosdb-tsg-service-unavailable
ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/10 cosmos-netstandard-sdk/3.21.1);; Diagnostics:{"name":"NoOp","id":"00000000-0000-0000-0000-000000000000","caller info":{"member":"NoOp","file":"NoOp","line":9001},"start time":"12:00:00:000","duration in milliseconds":0}
Response Body: Service is currently unavailable. More info: https://aka.ms/cosmosdb-tsg-service-unavailable
ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/10 cosmos-netstandard-sdk/3.21.1
Status Code: ServiceUnavailable
Sub Status Code: 0
Request Charge: 0
Activity Id: 46d15989-1ef0-4630-8b44-be8fa4a38b67
Retry After:
Headers: Microsoft.Azure.Cosmos.Headers
Diagnostics: {"name":"NoOp","id":"00000000-0000-0000-0000-000000000000","caller info":{"member":"NoOp","file":"NoOp","line":9001},"start time":"12:00:00:000","duration in milliseconds":0}
Target Site: Microsoft.Azure.Cosmos.ResponseMessage EnsureSuccessStatusCode()
Help Link:
Source: Microsoft.Azure.Cosmos.Client
HResult: -2146233088

Stack Trace Details
at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)
at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback) at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func2 prepareRequestAsyncDelegate)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)

Exception Details
Exception Type: Microsoft.Azure.Documents.GoneException
Message: The requested resource is no longer available at the server.
ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Linux/10 cosmos-netstandard-sdk/3.21.1
Error: {
"code": "Gone",
"message": "The requested resource is no longer available at the server.\nActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Linux/10 cosmos-netstandard-sdk/3.21.1"
}
Activity Id: 46d15989-1ef0-4630-8b44-be8fa4a38b67
Retry After: 00:00:00
Response Headers: System.Collections.Specialized.NameValueCollection
Status Code: Gone
Request Charge: 0
Script Log:
Target Site: Void ThrowGoneIfElapsed()
Help Link:
Source: Microsoft.Azure.Cosmos.Direct
HResult: -2146233088

Stack Trace Details
at Microsoft.Azure.Documents.TimeoutHelper.ThrowGoneIfElapsed()
at Microsoft.Azure.Documents.StoreReader.ReadMultipleReplicasInternalAsync(DocumentServiceRequest entity, Boolean includePrimary, Int32 replicaCountToRead, Boolean requiresValidLsn, Boolean useSessionToken, ReadMode readMode, Boolean checkMinLSN, Boolean forceReadAll)
at Microsoft.Azure.Documents.StoreReader.ReadMultipleReplicaAsync(DocumentServiceRequest entity, Boolean includePrimary, Int32 replicaCountToRead, Boolean requiresValidLsn, Boolean useSessionToken, ReadMode readMode, Boolean checkMinLSN, Boolean forceReadAll)
at Microsoft.Azure.Documents.ConsistencyReader.ReadSessionAsync(DocumentServiceRequest entity, ReadMode readMode)
at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync(Func1 callbackMethod, Func3 callShouldRetry, Func1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback) at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException) at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync(Func1 callbackMethod, Func3 callShouldRetry, Func1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)
at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass30_0.<b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)

Just to clarify we did double-check the configuration and the existence of the containers and the collection including the leases. everything exists and we even manage to read changes from it using the ChangeFeedProcessor class so we have no clue what happened that those errors started. this only happens in 1 region currently (west Europe)

Thanks,
Ben Gabay.

@j82w j82w added bug Something isn't working ChangeFeed needs-investigation labels Feb 23, 2022
@j82w
Copy link
Contributor

j82w commented Feb 23, 2022

Please upgrade to the latest SDK version. There was the following fix to the diagnostics which will help troubleshoot this issue.

#2883 Change Feed Processor: Fixes diagnostics on Estimator and ChangeFeedProcessorContext. Introduced in 3.15.0 PR #1933

@ealsur
Copy link
Member

ealsur commented Feb 23, 2022

If CPU is at 100% then that will affect network operations. Are you running the Estimator on the same machine as the Processor? Have you investigated the source of the high CPU?

@bengabayAtM
Copy link
Author

Hi, all thanks for the quick response. I am working right now with this guide:
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout?tabs=cpu-new#high-cpu-utilization
and updated to the latest SDK version to see the details of the diagnostic.
will update here soon.

@bengabayAtM
Copy link
Author

update:
I updated the sdk to the latest version and it seems that the timeout is caused due to thread starvation since we have isThreadStarving":"True in the diagnostic details:

env_time: 2022-02-27 07:34:59.881
Source:
Level:
Message: {"env_time":"2022-02-27T07:34:27.1813091Z","Level":"Error","LevelId":2,"EventId":4746,"Message":"[Partial Trace 8ff99fdf-7d28-4d8d-87da-752eb201aa9e 1/2] GetChangeFeedEstimationForSingleCollectionAsync: Ctx-8c0ecf00, Collection cyberprofiles: Failed to get change feed estimation. Elapsed: 560678. StatusCode: RequestTimeout, Exception:

Exception Details

Exception Type: Microsoft.Azure.Cosmos.CosmosException
Message: Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;);; Diagnostics:{"Summary":{"RegionsContacted":1,"GatewayCalls":{"(0, 0)":3}},"name":"FeedIterator Read Next Async","id":"71f68f9b-29ae-4073-ae37-a1b63149dcd0","start time":"07:25:07:127","duration in milliseconds":547953.186,"data":{"Client Configuration":{"Client Created Time Utc":"2022-02-27T07:23:05.4860608Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1070-azure 73 18.|.NET Core 3.1.22|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[West Europe, North Europe])"},"Query Correlated ActivityId":"b968d8bf-f71b-4ab2-8a6b-9d46966e9935"},"children":[{"name":"Create Query Pipeline","id":"928fd61d-b9ed-4b1f-ad66-e08344f2ebf3","start time":"07:25:07:128","duration in milliseconds":539255.3269,"children":[{"name":"Get Container Properties","id":"4a96be03-e531-4fd3-882d-36691e37dc72","start time":"07:25:07:128","duration in milliseconds":0.0104,"children":[{"name":"Get Collection Cache","id":"51ce4f81-be89-469b-a51c-ec7a2e805ff7","start time":"07:25:07:128","duration in milliseconds":0.0013}]},{"name":"Gateway QueryPlan","id":"b3be9087-d9fa-467b-99d9-1980b3195b68","start time":"07:25:07:128","duration in milliseconds":1.0805},{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"494e67de-9e99-4678-85c5-1fc4c268b6af","start time":"07:25:07:128","duration in milliseconds":539254.4997,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"7c9e09c5-4b58-4c31-a8bb-04489ae93c85","start time":"07:25:07:128","duration in milliseconds":539254.4286,"data":{"System Info":{"systemHistory":[{"dateUtc":"2022-02-27T07:24:47.2851573Z","cpu":12.596,"memory":97107412.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":20881.7367,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:24:57.4139160Z","cpu":22.459,"memory":98228292.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":31010.497,"availableThreads":32760,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:09.3891403Z","cpu":17.334,"memory":97682600.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":42985.723,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:41.9807223Z","cpu":30.114,"memory":91675256.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":75577.3049,"availableThreads":32739,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:26:08.8806595Z","cpu":28.615,"memory":89361728.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":102477.2429,"availableThreads":32708,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:33:12.5824814Z","cpu":27.779,"memory":96095100.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":526179.0633,"availableThreads":32588,"minThreads":1,"maxThreads":32767}}]}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"ee6dd3a8-922f-43d9-bb22-08e29199a642","start time":"07:25:07:128","duration in milliseconds":539254.4162,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"cb058dbf-b588-4f17-9a77-a2af68d5afc9","start time":"07:25:07:128","duration in milliseconds":539254.3955,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"eac1ff51-26ba-48f3-a0ba-ebedf7902369","start time":"07:25:07:128","duration in milliseconds":539254.3914,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","id":"1875dfce-2e0d-4d26-8382-639f91c3f3f9","start time":"07:25:07:128","duration in milliseconds":539254.2627,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":["https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs"],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2022-02-27T07:25:07.1292981Z","DurationInMs":1956.2947,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:09.0858138Z","DurationInMs":28799.4182,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:53.0838820Z","DurationInMs":493299.0657,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."}]}}}]}]}]}]}]}]}]}
Response Body: GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;
Status Code: RequestTimeout
Sub Status Code: 0
Request Charge: 0
Activity Id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b
Retry After:
Headers: Microsoft.Azure.Cosmos.Headers
Diagnostics: {"Summary":{"RegionsContacted":1,"GatewayCalls":{"(0, 0)":3}},"name":"FeedIterator Read Next Async","id":"71f68f9b-29ae-4073-ae37-a1b63149dcd0","start time":"07:25:07:127","duration in milliseconds":547953.186,"data":{"Client Configuration":{"Client Created Time Utc":"2022-02-27T07:23:05.4860608Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1070-azure 73 18.|.NET Core 3.1.22|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[West Europe, North Europe])"},"Query Correlated ActivityId":"b968d8bf-f71b-4ab2-8a6b-9d46966e9935"},"children":[{"name":"Create Query Pipeline","id":"928fd61d-b9ed-4b1f-ad66-e08344f2ebf3","start time":"07:25:07:128","duration in milliseconds":539255.3269,"children":[{"name":"Get Container Properties","id":"4a96be03-e531-4fd3-882d-36691e37dc72","start time":"07:25:07:128","duration in milliseconds":0.0104,"children":[{"name":"Get Collection Cache","id":"51ce4f81-be89-469b-a51c-ec7a2e805ff7","start time":"07:25:07:128","duration in milliseconds":0.0013}]},{"name":"Gateway QueryPlan","id":"b3be9087-d9fa-467b-99d9-1980b3195b68","start time":"07:25:07:128","duration in milliseconds":1.0805},{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"494e67de-9e99-4678-85c5-1fc4c268b6af","start time":"07:25:07:128","duration in milliseconds":539254.4997,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"7c9e09c5-4b58-4c31-a8bb-04489ae93c85","start time":"07:25:07:128","duration in milliseconds":539254.4286,"data":{"System Info":{"systemHistory":[{"dateUtc":"2022-02-27T07:24:47.2851573Z","cpu":12.596,"memory":97107412.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":20881.7367,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:24:57.4139160Z","cpu":22.459,"memory":98228292.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":31010.497,"availableThreads":32760,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:09.3891403Z","cpu":17.334,"memory":97682600.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":42985.723,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:41.9807223Z","cpu":30.114,"memory":91675256.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":75577.3049,"availableThreads":32739,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:26:08.8806595Z","cpu":28.615,"memory":89361728.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":102477.2429,"availableThreads":32708,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:33:12.5824814Z","cpu":27.779,"memory":96095100.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":526179.0633,"availableThreads":32588,"minThreads":1,"maxThreads":32767}}]}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"ee6dd3a8-922f-43d9-bb22-08e29199a642","start time":"07:25:07:128","duration in milliseconds":539254.4162,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"cb058dbf-b588-4f17-9a77-a2af68d5afc9","start time":"07:25:07:128","duration in milliseconds":539254.3955,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"eac1ff51-26ba-48f3-a0ba-ebedf7902369","start time":"07:25:07:128","duration in milliseconds":539254.3914,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","id":"1875dfce-2e0d-4d26-8382-639f91c3f3f9","start time":"07:25:07:128","duration in milliseconds":539254.2627,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":["https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs"],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2022-02-27T07:25:07.1292981Z","DurationInMs":1956.2947,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:09.0858138Z","DurationInMs":28799.4182,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:53.0838820Z","DurationInMs":493299.0657,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."}]}}}]}]}]}]}]}]}]}
Target Site: Void MoveNext()
Help Link:
Source: Microsoft.Azure.Cosmos.Client
HResult: -2146233088

Stack Trace Details

at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task) at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, RequestOptions requestOptions, Func2 task, TraceComponent traceComponent, TraceLevel traceLevel)
at Detection.GlobalProfileSynchronizer.Service.ChangeFeedEstimatorBackgroundService.GetChangeFeedEstimationForSingleCollectionAsync(String collectionName, ChangeFeedEstimator changeFeedEstimator, CancellationToken cancellationToken) in /app/Detection.GlobalProfileSynchronizer/Service/ChangeFeedEstimatorBackgroundService.cs:line 159

Exception Details

Exception Type: System.Threading.Tasks.TaskCanceledException
Task: System.Threading.Tasks.Task`1[System.Net.Http.HttpConnection]
Cancellation Token: System.Threading.CancellationToken
Target Site: Void ThrowForNonSuccess(System.Threading.Tasks.Task)
Message: A task was canceled.
Additional Data:
--> Message: GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;
Help Link:
Source: System.Private.CoreLib
HResult: -2146233029

Stack Trace Details

at System.Net.Http.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishS","Source":"ChangeFeedEstimatorBackgroundService","CustomFields":{"EventId":{"Id":4746}},"ContextId":"","WdatpExceptionFormatter":{"IncludeStackTrace":false},"Exception":{"Type":"Microsoft.Azure.Cosmos.CosmosException","Message":"Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;);; Diagnostics:{"Summary":{"RegionsContacted":1,"GatewayCalls":{"(0, 0)":3}},"name":"FeedIterator Read Next Async","id":"71f68f9b-29ae-4073-ae37-a1b63149dcd0","start time":"07:25:07:127","duration in milliseconds":547953.186,"data":{"Client Configuration":{"Client Created Time Utc":"2022-02-27T07:23:05.4860608Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1070-azure 73 18.|.NET Core 3.1.22|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, b
pod_name: globalprofilesynchronizer-estimator-6bcb84f47-x2zpm
DC: WEU
service_name: globalprofilesynchronizer-estimator
node: k8s-prdweu6p4-32499165-vmss000hmn
Exception:
message1: {"env_time":"2022-02-27T07:34:27.1813091Z","Level":"Error","LevelId":2,"EventId":4746,"Message":"[Partial Trace 8ff99fdf-7d28-4d8d-87da-752eb201aa9e 1/2] GetChangeFeedEstimationForSingleCollectionAsync: Ctx-8c0ecf00, Collection cyberprofiles: Failed to get change feed estimation. Elapsed: 560678. StatusCode: RequestTimeout, Exception:

Exception Details

Exception Type: Microsoft.Azure.Cosmos.CosmosException
Message: Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;);; Diagnostics:{"Summary":{"RegionsContacted":1,"GatewayCalls":{"(0, 0)":3}},"name":"FeedIterator Read Next Async","id":"71f68f9b-29ae-4073-ae37-a1b63149dcd0","start time":"07:25:07:127","duration in milliseconds":547953.186,"data":{"Client Configuration":{"Client Created Time Utc":"2022-02-27T07:23:05.4860608Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1070-azure 73 18.|.NET Core 3.1.22|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[West Europe, North Europe])"},"Query Correlated ActivityId":"b968d8bf-f71b-4ab2-8a6b-9d46966e9935"},"children":[{"name":"Create Query Pipeline","id":"928fd61d-b9ed-4b1f-ad66-e08344f2ebf3","start time":"07:25:07:128","duration in milliseconds":539255.3269,"children":[{"name":"Get Container Properties","id":"4a96be03-e531-4fd3-882d-36691e37dc72","start time":"07:25:07:128","duration in milliseconds":0.0104,"children":[{"name":"Get Collection Cache","id":"51ce4f81-be89-469b-a51c-ec7a2e805ff7","start time":"07:25:07:128","duration in milliseconds":0.0013}]},{"name":"Gateway QueryPlan","id":"b3be9087-d9fa-467b-99d9-1980b3195b68","start time":"07:25:07:128","duration in milliseconds":1.0805},{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"494e67de-9e99-4678-85c5-1fc4c268b6af","start time":"07:25:07:128","duration in milliseconds":539254.4997,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"7c9e09c5-4b58-4c31-a8bb-04489ae93c85","start time":"07:25:07:128","duration in milliseconds":539254.4286,"data":{"System Info":{"systemHistory":[{"dateUtc":"2022-02-27T07:24:47.2851573Z","cpu":12.596,"memory":97107412.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":20881.7367,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:24:57.4139160Z","cpu":22.459,"memory":98228292.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":31010.497,"availableThreads":32760,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:09.3891403Z","cpu":17.334,"memory":97682600.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":42985.723,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:41.9807223Z","cpu":30.114,"memory":91675256.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":75577.3049,"availableThreads":32739,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:26:08.8806595Z","cpu":28.615,"memory":89361728.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":102477.2429,"availableThreads":32708,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:33:12.5824814Z","cpu":27.779,"memory":96095100.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":526179.0633,"availableThreads":32588,"minThreads":1,"maxThreads":32767}}]}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"ee6dd3a8-922f-43d9-bb22-08e29199a642","start time":"07:25:07:128","duration in milliseconds":539254.4162,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"cb058dbf-b588-4f17-9a77-a2af68d5afc9","start time":"07:25:07:128","duration in milliseconds":539254.3955,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"eac1ff51-26ba-48f3-a0ba-ebedf7902369","start time":"07:25:07:128","duration in milliseconds":539254.3914,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","id":"1875dfce-2e0d-4d26-8382-639f91c3f3f9","start time":"07:25:07:128","duration in milliseconds":539254.2627,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":["https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs"],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2022-02-27T07:25:07.1292981Z","DurationInMs":1956.2947,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:09.0858138Z","DurationInMs":28799.4182,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:53.0838820Z","DurationInMs":493299.0657,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."}]}}}]}]}]}]}]}]}]}
Response Body: GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;
Status Code: RequestTimeout
Sub Status Code: 0
Request Charge: 0
Activity Id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b
Retry After:
Headers: Microsoft.Azure.Cosmos.Headers
Diagnostics: {"Summary":{"RegionsContacted":1,"GatewayCalls":{"(0, 0)":3}},"name":"FeedIterator Read Next Async","id":"71f68f9b-29ae-4073-ae37-a1b63149dcd0","start time":"07:25:07:127","duration in milliseconds":547953.186,"data":{"Client Configuration":{"Client Created Time Utc":"2022-02-27T07:23:05.4860608Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1070-azure 73 18.|.NET Core 3.1.22|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[West Europe, North Europe])"},"Query Correlated ActivityId":"b968d8bf-f71b-4ab2-8a6b-9d46966e9935"},"children":[{"name":"Create Query Pipeline","id":"928fd61d-b9ed-4b1f-ad66-e08344f2ebf3","start time":"07:25:07:128","duration in milliseconds":539255.3269,"children":[{"name":"Get Container Properties","id":"4a96be03-e531-4fd3-882d-36691e37dc72","start time":"07:25:07:128","duration in milliseconds":0.0104,"children":[{"name":"Get Collection Cache","id":"51ce4f81-be89-469b-a51c-ec7a2e805ff7","start time":"07:25:07:128","duration in milliseconds":0.0013}]},{"name":"Gateway QueryPlan","id":"b3be9087-d9fa-467b-99d9-1980b3195b68","start time":"07:25:07:128","duration in milliseconds":1.0805},{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"494e67de-9e99-4678-85c5-1fc4c268b6af","start time":"07:25:07:128","duration in milliseconds":539254.4997,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"7c9e09c5-4b58-4c31-a8bb-04489ae93c85","start time":"07:25:07:128","duration in milliseconds":539254.4286,"data":{"System Info":{"systemHistory":[{"dateUtc":"2022-02-27T07:24:47.2851573Z","cpu":12.596,"memory":97107412.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":20881.7367,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:24:57.4139160Z","cpu":22.459,"memory":98228292.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":31010.497,"availableThreads":32760,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:09.3891403Z","cpu":17.334,"memory":97682600.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":42985.723,"availableThreads":32764,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:25:41.9807223Z","cpu":30.114,"memory":91675256.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":75577.3049,"availableThreads":32739,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:26:08.8806595Z","cpu":28.615,"memory":89361728.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":102477.2429,"availableThreads":32708,"minThreads":1,"maxThreads":32767}},{"dateUtc":"2022-02-27T07:33:12.5824814Z","cpu":27.779,"memory":96095100.000,"threadInfo":{"isThreadStarving":"True","threadWaitIntervalInMs":526179.0633,"availableThreads":32588,"minThreads":1,"maxThreads":32767}}]}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"ee6dd3a8-922f-43d9-bb22-08e29199a642","start time":"07:25:07:128","duration in milliseconds":539254.4162,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"cb058dbf-b588-4f17-9a77-a2af68d5afc9","start time":"07:25:07:128","duration in milliseconds":539254.3955,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"eac1ff51-26ba-48f3-a0ba-ebedf7902369","start time":"07:25:07:128","duration in milliseconds":539254.3914,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","id":"1875dfce-2e0d-4d26-8382-639f91c3f3f9","start time":"07:25:07:128","duration in milliseconds":539254.2627,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":["https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs"],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2022-02-27T07:25:07.1292981Z","DurationInMs":1956.2947,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:09.0858138Z","DurationInMs":28799.4182,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."},{"StartTimeUTC":"2022-02-27T07:25:53.0838820Z","DurationInMs":493299.0657,"RequestUri":"https://detection-cyberprofiles-prd-weu-westeurope.documents.azure.com/dbs/leases/colls/cyberprofilesleases/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"87527a6c-c571-4e41-a3d5-ecb38a10ae1b","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"A task was canceled."}]}}}]}]}]}]}]}]}]}
Target Site: Void MoveNext()
Help Link:
Source: Microsoft.Azure.Cosmos.Client
HResult: -2146233088

Stack Trace Details

at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task) at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, RequestOptions requestOptions, Func2 task, TraceComponent traceComponent, TraceLevel traceLevel)
at Detection.GlobalProfileSynchronizer.Service.ChangeFeedEstimatorBackgroundService.GetChangeFeedEstimationForSingleCollectionAsync(String collectionName, ChangeFeedEstimator changeFeedEstimator, CancellationToken cancellationToken) in /app/Detection.GlobalProfileSynchronizer/Service/ChangeFeedEstimatorBackgroundService.cs:line 159

Exception Details

Exception Type: System.Threading.Tasks.TaskCanceledException
Task: System.Threading.Tasks.Task`1[System.Net.Http.HttpConnection]
Cancellation Token: System.Threading.CancellationToken
Target Site: Void ThrowForNonSuccess(System.Threading.Tasks.Task)
Message: A task was canceled.
Additional Data:
--> Message: GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;
Help Link:
Source: System.Private.CoreLib
HResult: -2146233029

Stack Trace Details

at System.Net.Http.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishS","Source":"ChangeFeedEstimatorBackgroundService","CustomFields":{"EventId":{"Id":4746}},"ContextId":"","WdatpExceptionFormatter":{"IncludeStackTrace":false},"Exception":{"Type":"Microsoft.Azure.Cosmos.CosmosException","Message":"Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:02/27/2022 07:25:07; Total Duration:539253.7587 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 87527a6c-c571-4e41-a3d5-ecb38a10ae1b;);; Diagnostics:{"Summary":{"RegionsContacted":1,"GatewayCalls":{"(0, 0)":3}},"name":"FeedIterator Read Next Async","id":"71f68f9b-29ae-4073-ae37-a1b63149dcd0","start time":"07:25:07:127","duration in milliseconds":547953.186,"data":{"Client Configuration":{"Client Created Time Utc":"2022-02-27T07:23:05.4860608Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1070-azure 73 18.|.NET Core 3.1.22|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, b
container_name: wdatp-service
LevelId:

still investigating the issue.

@bengabayAtM
Copy link
Author

If CPU is at 100% then that will affect network operations. Are you running the Estimator on the same machine as the Processor? Have you investigated the source of the high CPU?

the estimator is running alone on the machine. the Processor is running on a different machine. still looking for the source of the high cpu

@ealsur
Copy link
Member

ealsur commented Feb 28, 2022

Do you have any code that might be doing .Wait()/.Result/.GetAwaiter().GetResult()/Task.Run().Result ?

The "threadWaitIntervalInMs":42985.723 numbers are indeed pointing at something going on the thread-pool.

@bengabayAtM
Copy link
Author

bengabayAtM commented Mar 2, 2022

we have this section that has GetAwaiter().GetResult() on the RunInParallelAsync which returns a task:
image

where the RunInParallelAsync method is our implementation which in general uses Task.WhenAll() to wait for all tasks. it gets a collection and a method that operate on that collection as you can see. the collection in this case has only 2 items. 1 changeFeedEstimator for each collection in cosmos db we are tracking.

regarding GetChangeFeedEstimationAsyncWrapper, it is a method that wraps the original call for GetChangeFeedEstimationForSingleCollectionAsync (which is where we actually call ReadNextAsync). we use it to create also our own timeout task on the client-side (our side).
it looks like this:
image

and the interesting part of the _ is this where you can also see the ReadNextAsync call which is where our code gets stuck for long time and we get a CPU of 100%.
image

this is how the RunInParallelAsync looks like:
image

any suggestion?

@ealsur
Copy link
Member

ealsur commented May 26, 2022

Yes, as a suggestion, remove the GetAwaiter().GetResult(), you are blocking threads, instead use the async/await pattern.

Or change the page size to 1 on ChangeFeedEstimatorRequestOptions, that might help reduce the parallelism.

@bengabayAtM
Copy link
Author

Hi @ealsur. So I have simplified this service a lot. and nothing helps. (we also use ChangeFeedEstimatorRequestOptions.MaxItemCount = null, if this is what you meant by changing the page size to 1).
this service used to have 2 instances running. now it only has 1. the way it works is that. we get 2 different changeFeedEstimators (1 for every collection, we have 2 collections we are monitoring in total.):

currently, the code is like this:

            foreach (ChangeFeedConfiguration changeFeedConfiguration in m_changeFeedConfigurationList)
            {
                Container leaseContainer = m_cosmosClient.GetContainer(changeFeedConfiguration.LeasesDatabaseName, changeFeedConfiguration.LeasesContainerName);
                Container monitoredContainer = m_cosmosClient.GetContainer(changeFeedConfiguration.MonitoredDatabaseName, changeFeedConfiguration.MonitoredContainerName);

                m_collectionNameToChangeFeedEstimatorMap.TryAdd(changeFeedConfiguration.MonitoredContainerName, monitoredContainer.GetChangeFeedEstimator(GlobalProfileSynchronizerBackgroundService.ProcessorName, leaseContainer));
            }

            m_maxParallelism = m_collectionNameToChangeFeedEstimatorMap.Count;

            m_tracer.TraceInformation($"Starting Change Feed Estimator");

            // Start periodic estimation, this will call the function GetChangeFeedEstimation every 1 minutes.
            m_changeFeedStatePoller = new PollingProvider().CreatePoller(
                m_monitoringClient,
                () => GetChangeFeedEstimation(cancellationToken),
                m_estimatorConfiguration.EstimationPeriod,
                "ChangeFeedStatePoller",
                invokeImmediately: false);

            m_tracer.TraceInformation("Change Feed Estimator started.");

The GetChangeFeedEstimation function looks like this. What it does is basically create 2 tasks that will run in parallel and calculate the lag estimation for each collection (it is only 2 tasks since we have only 2 collections we are monitoring). then we wait for both of those tasks to complete (we wait no more than 50 seconds, that is EstimationTimeout we configured):

        private void GetChangeFeedEstimation(CancellationToken cancellationToken)
        {
            IList<Task> tasks = new List<Task>();
            ContextId contextId = new ContextId();
            using (CancellationTokenSource cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken))
            {
                cts.CancelAfter(m_estimatorConfiguration.EstimationTimeout);
                foreach (var kvp in m_collectionNameToChangeFeedEstimatorMap)
                {
                    tasks.Add(GetChangeFeedEstimationForSingleCollectionAsync(collectionName: kvp.Key, changeFeedEstimator: kvp.Value, contextId, cts.Token));
                }

                Task allCompleted = Task.WhenAll(tasks);
                try
                {
                    allCompleted.Wait(cts.Token); // waiting for the estimation lag to be calculated for all collections in parallel
                    m_tracer.TraceInformation(contextId, $"Finished to calculate change feed estimation for all collections in less than {m_estimatorConfiguration.EstimationTimeout.TotalSeconds} seconds. collections: {string.Join(", ", m_collectionNameToChangeFeedEstimatorMap.Keys)}");
                }
                catch (Exception ex)
                {
                    m_tracer.TraceError(ex, contextId, $"Unexpected exception, failed to get change feed estimation for all collections in less than {m_estimatorConfiguration.EstimationTimeout.TotalSeconds} seconds. collections: {string.Join(", ", m_collectionNameToChangeFeedEstimatorMap.Keys)}");
                }
            }
        }

And the GetChangeFeedEstimationForSingleCollectionAsync function looks like this and it is the function where all the problems are happening and it is why I still think you have a bug somewhere in your ReadNextAsync function...
Every time the ReadNexyAsync function is being called it is getting our CPU to 100% and after about 60-70 seconds we are getting a CosmosOperationCanceledException.

        private async Task GetChangeFeedEstimationForSingleCollectionAsync(string collectionName, ChangeFeedEstimator changeFeedEstimator, ContextId contextId, CancellationToken cancellationToken)
        {
            using (var op = m_actionScopeMetric.Start("GetChangeFeedEstimation", collectionName))
            {
                m_tracer.TraceInformation(contextId, $"Checking estimation for collection: {collectionName}...\nIsCancellationRequested: {cancellationToken.IsCancellationRequested}");

                ThreadPool.GetAvailableThreads(out int workerThreads, out int completionPortThreads);
                m_tracer.TraceInformation(contextId, $"number of available worker threads: {workerThreads}, number of available asynchronous I/O threads: {completionPortThreads}");

                Stopwatch sw = Stopwatch.StartNew();

                try
                {
                    cancellationToken.ThrowIfCancellationRequested();

                    List<ChangeFeedProcessorState> results = new List<ChangeFeedProcessorState>();
                    long totalEstimation = 0;

                    using (FeedIterator<ChangeFeedProcessorState> estimatorIterator = changeFeedEstimator.GetCurrentStateIterator(m_changeFeedEstimatorRequestOptions))
                    {
                        m_tracer.TraceInformation(contextId, $"starting to read states from changeFeedEstimator, collection: {collectionName}");
                        while (estimatorIterator.HasMoreResults)
                        {
                            cancellationToken.ThrowIfCancellationRequested();
                            FeedResponse<ChangeFeedProcessorState> states = await estimatorIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false);
                            m_tracer.TraceInformation(contextId, $"num of states read so far: {results.Count}. Read {states.Count()} states, collection: {collectionName}. time Elapsed: {sw.ElapsedMilliseconds}");
                            m_reportRusMetric.LogValue((int)states.RequestCharge, HttpStatusCode.Accepted.AsString(), collectionName);
                            results.AddRange(states);
                        }
                    }

                    if (results.Count > 0)
                    {
                        var estimatedLagByInstance = results.GroupBy(leaseState => leaseState.InstanceName)
                            .Select(g => (InstanceName: g.Key, OwnedLeases: g.Count(), EstimatedLag: g.Sum(estimation => estimation.EstimatedLag)));

                        foreach (var instanceEstimation in estimatedLagByInstance)
                        {
                            string host = instanceEstimation.InstanceName == null ? $"not owned by any host currently" : $"owned by host {instanceEstimation.InstanceName}";
                            m_tracer.TraceInformation(contextId, $"{instanceEstimation.OwnedLeases} leases {host} reports sum of {instanceEstimation.EstimatedLag} as estimated lag in collection: {collectionName}. time Elapsed: {sw.ElapsedMilliseconds}");

                            m_changeFeedBacklogByInstanceMetric.LogValue(instanceEstimation.EstimatedLag, instanceEstimation.InstanceName ?? "NotOwned", collectionName);
                            totalEstimation += instanceEstimation.EstimatedLag;
                        }
                    }

                    m_changeFeedBacklogMetric.LogValue(totalEstimation, collectionName);
                    m_tracer.TraceInformation(contextId, $"Backlog estimation: {totalEstimation} in collection: {collectionName}, number of states read: {results.Count}. Elapsed: {sw.ElapsedMilliseconds}");
                }
                catch (OperationCanceledException oce)
                {
                    m_tracer.TraceError(oce, contextId, $"Failed to get change feed estimation for collection: {collectionName}, operation was canceled, Elapsed: {sw.ElapsedMilliseconds}");
                    op.Failure();
                }
                catch (Exception e)
                {
                    CosmosException cosmosException = GetCosmosClientException(e);
                    if (cosmosException != null)
                    {
                        m_reportRusMetric.LogValue((int)cosmosException.RequestCharge, cosmosException.StatusCode.AsString(), collectionName);
                    }

                    m_tracer.TraceError(e, contextId, $"Failed to get change feed estimation in collection: {collectionName}. Elapsed: {sw.ElapsedMilliseconds}.{(cosmosException != null ? $" StatusCode: {cosmosException.StatusCode}" : string.Empty)}");
                    op.Failure();
                }
            }
        }

here is an example of an exception. if you still think the problem is in the way we use the function or generally in our code. please let me know. I will also open a post on StackOverflow, maybe someone will be able to help me there.

GetChangeFeedEstimationForSingleCollectionAsync: Ctx-651a1ab6: Failed to get change feed estimation for collection: cyberprofiles, operation was canceled, Elapsed: 57100, Exception:

Exception Details

Exception Type: Microsoft.Azure.Cosmos.CosmosOperationCanceledException
Source: System.Private.CoreLib
Message: The operation was canceled.
CosmosDiagnostics: {"Summary":{},"name":"Change Feed Estimator Read Next Async","id":"dbc7444c-a468-4deb-ab46-f322c2a5fc48","start time":"11:31:08:143","duration in milliseconds":53389.1395,"data":{"Client Configuration":{"Client Created Time Utc":"2022-05-29T11:15:16.5436709Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1080-azure 83 18.|.NET 5.0.17|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[Central US, East US 2])"},"Operation Cancelled Exception":"System.OperationCanceledException: The operation was canceled.\n
at System.Threading.CancellationToken.ThrowOperationCanceledException()\n
at Microsoft.Azure.Cosmos.Common.CollectionCache.ResolveByNameAsync(String apiVersion, String resourceAddress, Boolean forceRefesh, ITrace trace, IClientSideRequestStatistics clientSideRequestStatistics, CancellationToken cancellationToken)\n
at Microsoft.Azure.Cosmos.ContainerCore.GetCachedContainerPropertiesAsync(Boolean forceRefresh, ITrace trace, CancellationToken cancellationToken)\n
at Microsoft.Azure.Cosmos.ContainerCore.GetCachedRIDAsync(Boolean forceRefresh, ITrace trace, CancellationToken cancellationToken)\n
at Microsoft.Azure.Cosmos.ChangeFeed.Utils.CosmosContainerExtensions.GetMonitoredDatabaseAndContainerRidAsync(Container monitoredContainer, CancellationToken cancellationToken)\n
at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.InitializeLeaseStoreAsync(ITrace trace, CancellationToken cancellationToken)\n
at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken)\n
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task)"},"children":[{"name":"Initialize Lease Store","id":"7ea232b8-0b1d-4796-9de6-05d501b20056","start time":"11:31:51:728","duration in milliseconds":403.3695}]} Diagnostics: {"Summary":{},"name":"Change Feed Estimator Read Next Async","id":"dbc7444c-a468-4deb-ab46-f322c2a5fc48","start time":"11:31:08:143","duration in milliseconds":53389.1395,"data":{"Client Configuration":{"Client Created Time Utc":"2022-05-29T11:15:16.5436709Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.25.0|3.24.1|1|X64|Linux 5.4.0-1080-azure 83 18.|.NET 5.0.17|L|","ConnectionConfig":{"gw":"(cps:50, urto:60, p:False, httpf: False)","rntbd":"(cto: 5, icto: 18000, mrpc: 30, mcpe: 5000, erd: True, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[Central US, East US 2])"},"Operation Cancelled Exception":"System.OperationCanceledException: The operation was canceled.\n at System.Threading.CancellationToken.ThrowOperationCanceledException()\n at Microsoft.Azure.Cosmos.Common.CollectionCache.ResolveByNameAsync(String apiVersion, String resourceAddress, Boolean forceRefesh, ITrace trace, IClientSideRequestStatistics clientSideRequestStatistics, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.GetCachedContainerPropertiesAsync(Boolean forceRefresh, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.GetCachedRIDAsync(Boolean forceRefresh, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ChangeFeed.Utils.CosmosContainerExtensions.GetMonitoredDatabaseAndContainerRidAsync(Container monitoredContainer, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.InitializeLeaseStoreAsync(ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task)"},"children":[{"name":"Initialize Lease Store","id":"7ea232b8-0b1d-4796-9de6-05d501b20056","start time":"11:31:51:728","duration in milliseconds":403.3695}]}
Help Link:
Cancellation Token: System.Threading.CancellationToken
Target Site: Void MoveNext()
HResult: -2146233029

Stack Trace Details

at System.Threading.CancellationToken.ThrowOperationCanceledException()
at Microsoft.Azure.Cosmos.Common.CollectionCache.ResolveByNameAsync(String apiVersion, String resourceAddress, Boolean forceRefesh, ITrace trace, IClientSideRequestStatistics clientSideRequestStatistics, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ContainerCore.GetCachedContainerPropertiesAsync(Boolean forceRefresh, ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ContainerCore.GetCachedRIDAsync(Boolean forceRefresh, ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ChangeFeed.Utils.CosmosContainerExtensions.GetMonitoredDatabaseAndContainerRidAsync(Container monitoredContainer, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.InitializeLeaseStoreAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func`2 task)

P.S - also is our 50 seconds timeout seem enough in your opinion? or should we make it larger. I didn't touch it since we still get 100% CPU every time ReadNexyAsync is being called so I never thought making the timeout period longer would help...

Thanks for your time and help.

@ealsur
Copy link
Member

ealsur commented May 30, 2022

The exception you are sharing just shows a CancellationToken cancelling:

System.OperationCanceledException: The operation was canceled.\n at System.Threading.CancellationToken.ThrowOperationCanceledException

The code is not showing what m_changeFeedEstimatorRequestOptions contains, but are you setting the MaxItemCount to 1 (not null)?

It's impossible to know why your CPU is a 100% while inside GetChangeFeedEstimationForSingleCollectionAsync by looking at the code, you should take a dump and do CPU analysis to understand where is the time spent.

Another point is, your code is still blocking threads:

allCompleted.Wait(cts.Token); // waiting for the estimation lag to be calculated for all collections in parallel

Reference: https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/AsyncGuidance.md#avoid-using-taskresult-and-taskwait

@ealsur
Copy link
Member

ealsur commented Jun 29, 2022

Closing due to lack of activity. Please re-open with more information or insights if needed.

@ealsur ealsur closed this as completed Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants