Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CosmosClient Initialization: Adds implementation for opening Rntbd connections to backend replica nodes in Direct mode. #3508

Merged
merged 15 commits into from
Oct 25, 2022

Conversation

kundadebdatta
Copy link
Member

Pull Request Template

Description

Background:

Currently, during the CosmosClient.CreateAndInitializeAsync(), the .net SDK uses a method called InitializeContainerAsync(), which leverages dummy query on all feed ranges to open connection, with retry capability to increase the chances of touching max replica. However, there are few drawbacks with current approach, which are highlighted below.

  • It requires unnecessary query plan request to the gateway.
  • This approach doesn't guarantee to touch all secondary replicas.
  • Does not open connection to primary.
  • Expensive in latency and RU.

The Solution:

Instead of using the dummy query during the initialization, we could leverage the Rntbd context negotiation to finish the connection establishment. This PR is adding the necessary implementation in the v3 SDK codebase, on top of the Cosmos.Direct package changes v3.29.2, which contains the methods that will help establishing the Rntbd connection to the backend replica nodes.

Design:

Please follow this link to understand more on the problem statement and designing the solution.

Type of change

Please delete options that are not relevant.

  • [] New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #3443

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@kundadebdatta kundadebdatta force-pushed the users/kundadebdatta/3442_client_initialize_using_rntbd branch from 77fe491 to 58bf46c Compare October 19, 2022 08:09
@kundadebdatta kundadebdatta changed the title [Cosmos Service Upgrade Resiliency] - Adds implementation for opening Rntbd connection to backend replicas [Internal] Cosmos Service Upgrade Resiliency: Adds implementation for opening Rntbd connection to backend replicas Oct 19, 2022
ealsur
ealsur previously approved these changes Oct 20, 2022
Copy link
Member

@ealsur ealsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

ealsur
ealsur previously approved these changes Oct 20, 2022
@kundadebdatta kundadebdatta force-pushed the users/kundadebdatta/3442_client_initialize_using_rntbd branch from 90923fd to b2331fa Compare October 24, 2022 17:08
@kundadebdatta kundadebdatta force-pushed the users/kundadebdatta/3442_client_initialize_using_rntbd branch from b2331fa to 91e4ffc Compare October 24, 2022 17:12
Copy link
Member

@ealsur ealsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments around the test cases, great progress!

@kundadebdatta kundadebdatta changed the title [Internal] Cosmos Service Upgrade Resiliency: Adds implementation for opening Rntbd connection to backend replicas CosmosClient Initialization: Adds implementation for opening Rntbd connections to backend replica nodes in Direct mode. Oct 25, 2022
Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks :)

@ealsur ealsur merged commit c0ac87a into master Oct 25, 2022
@ealsur ealsur deleted the users/kundadebdatta/3442_client_initialize_using_rntbd branch October 25, 2022 20:40
throw new ArgumentNullException(resourceName);
}

if (this.StoreModel != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it expected?
If so isn't it better to fail than silently easting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an scenario where StoreModel is null, we have checks like this in the same file, but can potentially be removed without issues

string containerLinkUri,
CancellationToken cancellationToken)
{
if (string.IsNullOrEmpty(databaseName) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Personally I like one if-statement for argument for readablity.

@@ -482,5 +482,10 @@ private Uri GetFeedUri(DocumentServiceRequest request)
{
return new Uri(this.endpointManager.ResolveServiceEndpoint(request), PathsHelper.GeneratePath(request.ResourceType, request, true));
}

public Task OpenConnectionsToAllReplicasAsync(string databaseName, string containerLinkUri, CancellationToken cancellationToken = default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument per line

}
}

foreach (DocumentServiceResponse response in await Task.WhenAll(tasks))
foreach (TryCatch<DocumentServiceResponse> task in await Task.WhenAll(tasks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Monoid type used jsut to capture exception?

One alternative is to do simple try-catch and use Task to get the exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monad is simpler/easier to work with afterwards, you cannot use try/catch inside the task with Task.WhenAll

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally - We needed an approach where the task is throwing an exception, it should continue the execution and makes sure, it doesn't fail the other pending tasks. There is a way we could achieve the same using Task.ContinueWith() but it's not a preferred way in dotnet. per this .net best practices and recommendations, it appears that await ing on a task is a preferred design choice over .ContinueWith(). Therefore, we decided to leverage the TryCatch framework.

System.Diagnostics.Trace.CorrelationManager.ActivityId);
try
{
await openConnectionHandlerAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As contract the caller is deciding to serialize creation.
Isn't it better to let the transport decide on it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only 1 transport that would call this (Direct) though?

CancellationToken cancellationToken)
{
await this.DocumentClient.EnsureValidClientAsync(NoOpTrace.Singleton);
await this.DocumentClient.OpenConnectionsToAllReplicasAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.DocumentClient.InitializeCachesAsync will take care of populating both partitionKeyRange and addresses caches.

One option is we could leverage it and start creating connections after its populated. It's a serial execution but that might suffice for current scenarios. Thoghts?

/// <param name="containerLinkUri">A string containing the container's link uri.</param>
/// <param name="openConnectionHandlerAsync">The transport client callback delegate to be invoked at a later point of time.</param>
/// <param name="cancellationToken">An Instance of the <see cref="CancellationToken"/>.</param>
public async Task OpenConnectionsToAllReplicasAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would prefer to not overload caches with unnecessary context.
Caches are critical and already very complex. This pushed unnecessary complexity into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this aligns / is consistent with the approach Java took though

@kundadebdatta kundadebdatta self-assigned this Jan 10, 2023
@kundadebdatta kundadebdatta added Engineering engineering improvements (CI, tests, etc.) improvement Change to existing functional behavior (perf, logging, etc.) Upgrade Resiliency labels Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering engineering improvements (CI, tests, etc.) improvement Change to existing functional behavior (perf, logging, etc.) Upgrade Resiliency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Cosmos Service Upgrade - .NET] Use RNTBD Connection Model in Cosmos V3 for connection initialization.
4 participants