[Remote Store] Primary term validation with replicas - New approach POC #5033
Closed
Labels
distributed framework
enhancement
Enhancement or improvement to existing feature or request
Indexing & Search
Storage:Durability
Issues and PRs related to the durability framework
v2.5.0
'Issues and PRs related to version v2.5.0'
Is your feature request related to a problem? Please describe.
Primary term validation with replicas
In reference to #3706, with segment replication and remote store for storing translog, storing translog on replicas becomes obsolete. Not just that, the in sync replication call to the replicas that happens during a write call becomes obsolete. And as we know, the replication call serves 2 use cases - 1) to replicate the data for durability and 2) primary term validation, while the 1st use case is taken cared off with using remote store for translog, the 2nd use case still needs to be handled.
Challenges
Until now, we were planning to achieve the no-op by modifying the TransportBulk/WriteAction call and making it no-op. While we do not store any data, there was still one concern as these replication calls modifies the replication tracker state in the replica shard. With the older approach, we would have needed a no-op replication tracker proxy and needed to cut off all calls to replicas that updates the replication tracker on the replicas. This is to make sure that we are not updating any state on the replica - be it the data part (segment/translog) or the (global/local) checkpoints during the replication call or the async calls. This is a bit cumbersome on implementation making the code a lot intertwined putting a lot of logic on when to update checkpoints (replication tracker) vs when to not. This approach is a bit messier. Following things hinder or adds to complexity with older approach -
assertInvariant()
method insideReplicationTracker
expects there is a PRRL present for each of the shard that is tracked in theCheckpointState
in theReplicationTracker
. For replicas, since we do not have translogs present locally, PRRL is not required anymore. If a shard is tracked, it is implied that there are PRRL existing.TransportReplicationAction
is a replicated call which is first performed on the primary and then fans out to the active replicas. These calls performs some activity on the primary and same or different activity on the replica. The most common one isTransportWriteAction
,TransportBulkShardAction
which are not required to be fanned out as we do not need the translogs written on replicas anymore. The below are other actions (that may or may not be required with remote store for translog) -TransportReplicationAction
actions. This would make the code full of if/else conditions and at the same time make the code highly unreadable and unmaintainable over long time.ReplicationTracker
class, there is a methodinvariant()
which is used to check certain invariants with respect to the replication tracker. This ensures certain expected behaviour for retention leases, global checkpoint, replication group for shards that are tracked.Proposal
However, this can be handled with a separate call for primary term validation (lets call it primary validation call) along side keeping the tracked and inSync as false in the CheckpointState in the ReplicationTracker. Whenever the cluster manager publishes the updated cluster state, the primary term would get updated on the replicas. When the primary term validation happens, the primary term supplied over the request is validated against the same. On incorrect primary term, the acknowledgement fails and the assuming isolated primary can fail itself. This also makes the approach and code change a bit cleaner.
With the new approach, we can do the following -
CheckpointState.tracked
,CheckpointState.inSync
as false. - Can be Handled.tracked
= false makes the existing replication calls to not happen.inSync
= false allows us to not tweak code in ReplicationTracker class and so we create another variable calledvalidatePrimaryTerm
. This can be used for knowing which all replicas where we can send the primary validation call.shard.initiateTracking(..)
andshard.markAllocationIdAsInSync(..)
. The request is fanned out to replicas by making a transport action which is no-op on primary and publishes checkpoint on replicas. If a shard is not tracked and not in sync, then Segment replication would stop to work. - Can be Handled.validatePrimaryTerm
= true. Currently, it probably uses the inSync allocations concept which it fetches from the cluster state. - No code change required on Master SideThe above approach requires further deep dive and a POC -
ShardRouting.state
changed toShardRoutingState.STARTED
orShardRoutingState.RELOCATING
?hasAllPeerRecoveryRetentionLeases
inReplicationTracker
class.updateFromClusterManager
. In this, if an allocation id does not exist in the inSyncAllocationIds or the initializingAllocationIds (derived from RoutingTable). Based on the primaryMode field value, the checkpoints are updated. This logic will be updated to account for primaryTermValidation.Open Questions -
Describe the solution you'd like
Mentioned above.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: