Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK Warmup requests should be hedged with the AvailabilityStrategy #4672

Open
adamnova opened this issue Sep 11, 2024 · 2 comments
Open

SDK Warmup requests should be hedged with the AvailabilityStrategy #4672

adamnova opened this issue Sep 11, 2024 · 2 comments
Labels
customer-reported Issue created by a customer needs-investigation
Milestone

Comments

@adamnova
Copy link

Is your feature request related to a problem? Please describe.
When Cosmos SDK is starting up it makes warmup request for pkranges to some seemingly built in database (the name seems to be base64 encoded) that will fail if the primary Cosmos region is unavailable for any reason. Since this is a read request I believe it should fall under the hedging availability strategy to not block clients to make GET calls even if the primary region is temporarily unavailable.

During testing I specified two application preferred regions and I blocked requests to the first one to simulate an outage. If I make the first request before this simulated outage, everything is working correctly and my ReadItemAsync fallbacks to the secondary region. But if I simulate the outage before the first request the request for pkranges (https://<your-cosmos-db-account>.documents.azure.com/dbs/<database-id>/colls/<collection-id>/pkranges) fails and never fallbacks to the secondary region even though it is a read request.

Describe the solution you'd like
All Cosmos SDK warm-up requests that read from the database should be able to fallback to secondary region.

Describe alternatives you've considered
I do not think there are any alternatives.

@kirankumarkolli
Copy link
Member

@adamnova thank you for reporting it.

It's a good one to follow-up on.

Trying to understand the impact you have seen: Is this a livesite issue resulting in unavailability loss for your service? Or test validation?

@adamnova
Copy link
Author

In this case I ran into it during evaluation of the new AvailabilityStrategy API. But this is something that happens in production as well and increasing startup resiliency is always a good idea to avoid scale-up issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Issue created by a customer needs-investigation
Projects
None yet
Development

No branches or pull requests

2 participants