-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FeedIterator.ReadNextAsync bug #3047
Comments
If CPU is at 100% then that will affect network operations. Are you running the Estimator on the same machine as the Processor? Have you investigated the source of the high CPU? |
Hi, all thanks for the quick response. I am working right now with this guide: |
update: env_time: 2022-02-27 07:34:59.881 Exception DetailsException Type: Microsoft.Azure.Cosmos.CosmosException Stack Trace Detailsat Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken) Exception DetailsException Type: System.Threading.Tasks.TaskCanceledException Stack Trace Detailsat System.Net.Http.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken) Exception DetailsException Type: Microsoft.Azure.Cosmos.CosmosException Stack Trace Detailsat Microsoft.Azure.Cosmos.ChangeFeed.ChangeFeedEstimatorIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken) Exception DetailsException Type: System.Threading.Tasks.TaskCanceledException Stack Trace Detailsat System.Net.Http.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken) still investigating the issue. |
the estimator is running alone on the machine. the Processor is running on a different machine. still looking for the source of the high cpu |
Do you have any code that might be doing .Wait()/.Result/.GetAwaiter().GetResult()/Task.Run().Result ? The |
Yes, as a suggestion, remove the Or change the page size to 1 on |
Hi @ealsur. So I have simplified this service a lot. and nothing helps. (we also use ChangeFeedEstimatorRequestOptions.MaxItemCount = null, if this is what you meant by changing the page size to 1). currently, the code is like this:
The GetChangeFeedEstimation function looks like this. What it does is basically create 2 tasks that will run in parallel and calculate the lag estimation for each collection (it is only 2 tasks since we have only 2 collections we are monitoring). then we wait for both of those tasks to complete (we wait no more than 50 seconds, that is EstimationTimeout we configured):
And the GetChangeFeedEstimationForSingleCollectionAsync function looks like this and it is the function where all the problems are happening and it is why I still think you have a bug somewhere in your ReadNextAsync function...
here is an example of an exception. if you still think the problem is in the way we use the function or generally in our code. please let me know. I will also open a post on StackOverflow, maybe someone will be able to help me there. GetChangeFeedEstimationForSingleCollectionAsync: Ctx-651a1ab6: Failed to get change feed estimation for collection: cyberprofiles, operation was canceled, Elapsed: 57100, Exception: Exception DetailsException Type: Microsoft.Azure.Cosmos.CosmosOperationCanceledException Stack Trace Detailsat System.Threading.CancellationToken.ThrowOperationCanceledException() P.S - also is our 50 seconds timeout seem enough in your opinion? or should we make it larger. I didn't touch it since we still get 100% CPU every time ReadNexyAsync is being called so I never thought making the timeout period longer would help... Thanks for your time and help. |
The exception you are sharing just shows a CancellationToken cancelling:
The code is not showing what It's impossible to know why your CPU is a 100% while inside Another point is, your code is still blocking threads:
|
Closing due to lack of activity. Please re-open with more information or insights if needed. |
Hi following the request from Azure cosmos db internal discussion I am fileing an issue here regarding a bug we are facing in PRD, we are having an issue regarding ReadNextAsync method: FeedIterator.ReadNextAsync(CancellationToken) Method (Microsoft.Azure.Cosmos) - Azure for .NET Developers | Microsoft Docs
we are using the Feed Iterator class:
using (FeedIterator<ChangeFeedProcessorState> estimatorIterator = changeFeedEstimator.GetCurrentStateIterator(m_changeFeedEstimatorRequestOptions))
where MaxItem = null in m_changeFeedEstimatorRequestOptions.
We then call ReadNextAsync
FeedResponse<ChangeFeedProcessorState> states = await estimatorIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false);
where the cancellationToken should be canceled after 50sec:
What ended up happening is that we call ReadNextAsync and never get a response from that method and while waiting for the response we are suffering from 100% CPU usage:
we tried updating the SDK version to the latest and adding a timeout task on our side, non of them solved the issue.
we rolled back to the latest working release we have and the issue still remains for some reason even though it never happened before in the current release deployed.
usually, after a couple of minutes sometimes could be even 30minutes we get an error thrown back from the ReadNextAsync method of one of the following types:
ServiceUnavailable (503)
NotFound (404)
RequestTimeout (408)
here is one of the latest errors we got for example:
GetChangeFeedEstimationForSingleCollectionAsync: Ctx-1df0279b, Collection cyberprofiles: Failed to get change feed estimation. Elapsed: 657196. StatusCode: ServiceUnavailable, Exception:
Exception Details
Exception Type: Microsoft.Azure.Cosmos.CosmosException
Message: Response status code does not indicate success: ServiceUnavailable (503); Substatus: 0; ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67; Reason: (Service is currently unavailable. More info: https://aka.ms/cosmosdb-tsg-service-unavailable
ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/10 cosmos-netstandard-sdk/3.21.1);; Diagnostics:{"name":"NoOp","id":"00000000-0000-0000-0000-000000000000","caller info":{"member":"NoOp","file":"NoOp","line":9001},"start time":"12:00:00:000","duration in milliseconds":0}
Response Body: Service is currently unavailable. More info: https://aka.ms/cosmosdb-tsg-service-unavailable
ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/10 cosmos-netstandard-sdk/3.21.1
Status Code: ServiceUnavailable
Sub Status Code: 0
Request Charge: 0
Activity Id: 46d15989-1ef0-4630-8b44-be8fa4a38b67
Retry After:
Headers: Microsoft.Azure.Cosmos.Headers
Diagnostics: {"name":"NoOp","id":"00000000-0000-0000-0000-000000000000","caller info":{"member":"NoOp","file":"NoOp","line":9001},"start time":"12:00:00:000","duration in milliseconds":0}
Target Site: Microsoft.Azure.Cosmos.ResponseMessage EnsureSuccessStatusCode()
Help Link:
Source: Microsoft.Azure.Cosmos.Client
HResult: -2146233088
Stack Trace Details
at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)
at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func
1 executeAsync, Func
1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func
1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback) at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func
2 prepareRequestAsyncDelegate)at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
Exception Details
Exception Type: Microsoft.Azure.Documents.GoneException
Message: The requested resource is no longer available at the server.
ActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Linux/10 cosmos-netstandard-sdk/3.21.1
Error: {
"code": "Gone",
"message": "The requested resource is no longer available at the server.\nActivityId: 46d15989-1ef0-4630-8b44-be8fa4a38b67, Linux/10 cosmos-netstandard-sdk/3.21.1"
}
Activity Id: 46d15989-1ef0-4630-8b44-be8fa4a38b67
Retry After: 00:00:00
Response Headers: System.Collections.Specialized.NameValueCollection
Status Code: Gone
Request Charge: 0
Script Log:
Target Site: Void ThrowGoneIfElapsed()
Help Link:
Source: Microsoft.Azure.Cosmos.Direct
HResult: -2146233088
Stack Trace Details
at Microsoft.Azure.Documents.TimeoutHelper.ThrowGoneIfElapsed()
at Microsoft.Azure.Documents.StoreReader.ReadMultipleReplicasInternalAsync(DocumentServiceRequest entity, Boolean includePrimary, Int32 replicaCountToRead, Boolean requiresValidLsn, Boolean useSessionToken, ReadMode readMode, Boolean checkMinLSN, Boolean forceReadAll)
at Microsoft.Azure.Documents.StoreReader.ReadMultipleReplicaAsync(DocumentServiceRequest entity, Boolean includePrimary, Int32 replicaCountToRead, Boolean requiresValidLsn, Boolean useSessionToken, ReadMode readMode, Boolean checkMinLSN, Boolean forceReadAll)
at Microsoft.Azure.Documents.ConsistencyReader.ReadSessionAsync(DocumentServiceRequest entity, ReadMode readMode)
at Microsoft.Azure.Documents.BackoffRetryUtility
1.ExecuteRetryAsync(Func
1 callbackMethod, Func3 callShouldRetry, Func
1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback) at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException) at Microsoft.Azure.Documents.BackoffRetryUtility
1.ExecuteRetryAsync(Func1 callbackMethod, Func
3 callShouldRetry, Func1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action
1 preRetryCallback)at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass30_0.<b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func
1 executeAsync, Func
1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func
1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)Just to clarify we did double-check the configuration and the existence of the containers and the collection including the leases. everything exists and we even manage to read changes from it using the ChangeFeedProcessor class so we have no clue what happened that those errors started. this only happens in 1 region currently (west Europe)
Thanks,
Ben Gabay.
The text was updated successfully, but these errors were encountered: