[RFC] Cluster State Publication using Remote State #13257
Labels
Cluster Manager
ClusterManager:RemoteState
enhancement
Enhancement or improvement to existing feature or request
Roadmap:Stability/Availability/Resiliency
Project-wide roadmap label
Background
As part of the initial changes in remote cluster state, we are uploading the cluster metadata to remote store during every cluster state publication. This remote cluster metadata gets utilized by the cluster manager nodes during recovery scenario. But there is no change to way the cluster state is published to other nodes in the cluster. The cluster manager still sends the full or diff cluster state to data nodes and other cluster manager nodes over transport layer. This increases the load on the cluster manager mainly since the cluster manager has to send cluster state objects to all the nodes over transport layer for every cluster state update. This becomes one of the bottlenecks in increasing the size of the cluster in terms number of indices, number of shards and number of nodes as any change in these parameters tends to increase the size of the cluster state.
#11744
Proposal
Instead of sending the cluster state object over the transport layer, the cluster manager node can just send the term and version of the cluster state over the transport layer which can then be used to download the corresponding cluster state by all the nodes directly from the remote store.
Related component
Cluster Manager
Additional context
Cluster State Persistence vs Publication
Before we get into the mechanism to publish the cluster state using remote store, we should understand the some differences between the features:
Diff Computation of Cluster State in Existing Flow
The diff is computed in a different manner for each of the objects.
Summary
The computation of diff is not very optimized.
For index metadata, we compute the index metadata for the indices which have changed. This diff of index metadata will always contain full settings. It will contain full mappings even when there a single field updated in the mapping. This computation of diff is very similar to the incremental metadata that we upload currently where we upload the entire index metadata whenever there is a small change in it.
For routing table, even if we upload the diff, the code will be updating for the full routing table for each of the indices which have changed. So this would be the same as the existing approach which is designed for routing table - split the routing table by indices and update them individually as and when they change.
So we know that we have to anyway upload all the incrementally changed objects for every cluster state update. If we upload the diff as computed currently which is not optimized fully, we will be uploading a lot of redundant data. So we can just go with the approach uploading an additional object which would help in identifying which components in the cluster state have changed. This object would then be used by the follower nodes to download the incremental changes and apply to the cluster state.
Approach
Based on the above analysis, we need to achieve the below tasks for the remote state publication to work.
Upload Ephemeral Objects to Remote Store
Currently we only publish the Metadata object in the cluster state to remote store. This is sufficient for the durability/persistence use case as the rest of the objects can be recreated by the cluster manager node. But for cluster state publication all the other objects as well. So we would need to publish all the objects to remote store. Listing the extra objects to be published:
Upload a Diff Object
In order to to publish the cluster state, we will first publish a diff file to the remote store. Based on the analysis for diff calculation above it is not optimal to publish a single diff blob to the remote store since in the cases where multiple indices have been changed, the diff blob would contain the entire metadata of all these indices which will make the diff file large. Instead the diff file would just contain the the information about which of the objects that have changed. Based on this information, the file locations for these objects can be fetched from manifest and downloaded. The diff file would look like below:
This diff object can be part of the ClusterMetadataManifest itself as it should not be very large.
Cluster Manager Node Publishes the term and version
Current state: In order to publish the cluster state the cluster manager node sends the diff state to all the nodes over the transport layer. If a follower node cannot apply the diff it responds with an exception and then the cluster manager sends the full state to the node.
When a node is trying to join the cluster, the cluster manager sends the cluster state in a validate join request to the joining node.
Proposed: The cluster manager node will send only the term and version of the cluster state. The follower node will download the manifest file for the term and version. From the manifest file, the node will get the diff and determine if the diff can be applied. If the diff cannot be applied, it will download all the files for the new cluster state. If the diff can be applied, it will find the objects which have changed and then download the files for all the objects and apply on the old cluster state.
In the validate join request as well, the term and version will only be sent. The joining node would download the full state corresponding term and version and perform the validations. The OpenSearch version of the joining node will be checked to ensure we have published the cluster state for this version.
New transport request will be created for the above changes.
Considerations
Version Specific Serialization of Cluster State
When the cluster manager is on a certain OpenSearch version, there can be some follower nodes which are on a different version. So the cluster state is sent over the tranport layer by serializing the cluster state according to the target node’s version. So the serialization should is done using writeTo method since this method is version aware.
The cluster manager node would have a visibility into the different versions of nodes present in the cluster. It should publish the cluster state for all the versions of nodes present in the cluster. Ideally this scenario should only happen during version upgrade scenario, so there should be a maximum of 2 different versions present at a time in the cluster.
So we need analyze and decide how should the serialization be done for remote state publication
The text was updated successfully, but these errors were encountered: