Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain history for peer recovery using leases #41536

Closed
9 of 10 tasks
DaveCTurner opened this issue Apr 25, 2019 · 2 comments
Closed
9 of 10 tasks

Retain history for peer recovery using leases #41536

DaveCTurner opened this issue Apr 25, 2019 · 2 comments
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Meta v7.4.0

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Apr 25, 2019

The goal is that we can perform an operations-based recovery for all "reasonable" shard copies C:

  • There is a peer recovery retention lease L corresponding with C.
  • Every in-sync shard copy has a complete history of operations above the retained seqno of L.
  • The retained seqno r of L is no greater than the local checkpoint of the last safe commit of C.

Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last index.soft_deletes.retention_lease.period (i.e. 12h) might reasonably be a recovery target.

We also require that history is eventually released: in a stable cluster, for every operation with seqno s below the MSN of a replication group, eventually there are no leases that retain s:

  • Every active shard copy eventually advances its LCPoSC past s.
  • Every lease for an active shard copy eventually also passes s.
  • Every inactive shard copy eventually either becomes active or else its lease expires.

Concretely, this should ensure that operations-based recoveries are possible in the following cases (subject to the copy being allocated back to the same node):

  • a shard copy C is offline for a short period (<12h)
    • even if the primary is relocated or a replica is promoted to primary while C is offline.
    • even if C was part of a closed/frozen/readonly index that was opened while C was offline
      • but not if the index was closed/frozen again before C comes back
      • TBD: maybe we are ok with this being a file-based recovery?
  • a full-cluster restart

This breaks into a few conceptually-separate pieces:

Followup work, out-of-scope for the feature branches.

  • Adjust translog retention

    • Should we retain translog generations according to retention leases too?
    • Trim translog files eagerly during the "verify-before-close" step for closed/frozen indices (Trim translog for closed indices #43156)
    • Properly support peer-recovery retention leases on indices that are not using soft deletes too.
  • Make the ReplicaShardAllocator sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery. (relates Replica allocation consider no-op #42518)

  • Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery. (relates Replica allocation consider no-op #42518)


BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.

Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases; moreover any existing PRRLs will not be retaining any history. We cannot assert that all the copies of a replicated closed index have a corresponding lease without performing replicated write actions to create such leases as we create new replicas, and nor can we assert that there are no leases on a replicated closed index since again this would require replicated write actions. We elect to ignore PRRLs on closed indices: they might exist, but they might not, and either way is fine.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. Meta 7x labels Apr 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 13, 2019
This creates a peer-recovery retention lease for every shard during recovery,
ensuring that the replication group retains history for future peer recoveries.
It also ensures that leases for active shard copies do not expire, and leases
for inactive shard copies expire immediately if the shard is fully-allocated.

Relates elastic#41536
DaveCTurner added a commit that referenced this issue Jun 19, 2019
This creates a peer-recovery retention lease for every shard during recovery,
ensuring that the replication group retains history for future peer recoveries.
It also ensures that leases for active shard copies do not expire, and leases
for inactive shard copies expire immediately if the shard is fully-allocated.

Relates #41536
DaveCTurner added a commit that referenced this issue Jun 19, 2019
This creates a peer-recovery retention lease for every shard during recovery,
ensuring that the replication group retains history for future peer recoveries.
It also ensures that leases for active shard copies do not expire, and leases
for inactive shard copies expire immediately if the shard is fully-allocated.

Relates #41536
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 28, 2019
This commit adjusts the behaviour of the retention lease sync to first renew
any peer-recovery retention leases where either:

- the corresponding shard's global checkpoint has advanced, or

- the lease is older than half of its expiry time

Relates elastic#41536
DaveCTurner added a commit that referenced this issue Jul 1, 2019
This commit adjusts the behaviour of the retention lease sync to first renew
any peer-recovery retention leases where either:

- the corresponding shard's global checkpoint has advanced, or

- the lease is older than half of its expiry time

Relates #41536
DaveCTurner added a commit that referenced this issue Jul 1, 2019
This commit adjusts the behaviour of the retention lease sync to first renew
any peer-recovery retention leases where either:

- the corresponding shard's global checkpoint has advanced, or

- the lease is older than half of its expiry time

Relates #41536
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 3, 2019
Today when a PRRL is created during peer recovery it retains history from the
sequence number provided by the recovering peer. However this sequence number
may be greater than the primary's knowledge of that peer's persisted global
checkpoint. Subsequent renewals of this lease will attempt to set the retained
sequence number back to the primary's knowledge of that peer's persisted global
checkpoint tripping an assertion that retention leases must only advance.  This
commit accounts for this.

Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)

Relates elastic#41536
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 3, 2019
If the primary performs a file-based recovery to a node that has (or recently
had) a copy of the shard then it is possible that the persisted global
checkpoint of the new copy is behind that of the old copy since file-based
recoveries are somewhat destructive operations.

Today we leave that node's PRRL in place during the recovery with the
expectation that it can be used by the new copy. However this isn't the case if
the new copy needs more history to be retained, because retention leases may
only advance and never retreat.

This commit addresses this by removing any existing PRRL during a file-based
recovery: since we are performing a file-based recovery we have already
determined that there isn't enough history for an ops-based recovery, so there
is little point in keeping the old lease in place.

Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)

Relates elastic#41536
DaveCTurner added a commit that referenced this issue Jul 4, 2019
If the primary performs a file-based recovery to a node that has (or recently
had) a copy of the shard then it is possible that the persisted global
checkpoint of the new copy is behind that of the old copy since file-based
recoveries are somewhat destructive operations.

Today we leave that node's PRRL in place during the recovery with the
expectation that it can be used by the new copy. However this isn't the case if
the new copy needs more history to be retained, because retention leases may
only advance and never retreat.

This commit addresses this by removing any existing PRRL during a file-based
recovery: since we are performing a file-based recovery we have already
determined that there isn't enough history for an ops-based recovery, so there
is little point in keeping the old lease in place.

Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)

Relates #41536
DaveCTurner added a commit that referenced this issue Jul 4, 2019
If the primary performs a file-based recovery to a node that has (or recently
had) a copy of the shard then it is possible that the persisted global
checkpoint of the new copy is behind that of the old copy since file-based
recoveries are somewhat destructive operations.

Today we leave that node's PRRL in place during the recovery with the
expectation that it can be used by the new copy. However this isn't the case if
the new copy needs more history to be retained, because retention leases may
only advance and never retreat.

This commit addresses this by removing any existing PRRL during a file-based
recovery: since we are performing a file-based recovery we have already
determined that there isn't enough history for an ops-based recovery, so there
is little point in keeping the old lease in place.

Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)

Relates #41536
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 5, 2019
Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates elastic#41536
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 5, 2019
Today when renewing PRRLs we assert that any invalid "backwards" renewals must
be because we are recovering the shard. In fact it's also possible to have
`checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a
tracked shard copy if the primary was just promoted and hasn't received
checkpoints from all of its peers too.

This commit weakens the assertion to match.

Caught by a [failure of the full cluster restart
tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605)

Relates elastic#41536
DaveCTurner added a commit that referenced this issue Jul 5, 2019
Today when renewing PRRLs we assert that any invalid "backwards" renewals must
be because we are recovering the shard. In fact it's also possible to have
`checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a
tracked shard copy if the primary was just promoted and hasn't received
checkpoints from all of its peers too.

This commit weakens the assertion to match.

Caught by a [failure of the full cluster restart
tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605)

Relates #41536
DaveCTurner added a commit that referenced this issue Jul 5, 2019
Today when renewing PRRLs we assert that any invalid "backwards" renewals must
be because we are recovering the shard. In fact it's also possible to have
`checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a
tracked shard copy if the primary was just promoted and hasn't received
checkpoints from all of its peers too.

This commit weakens the assertion to match.

Caught by a [failure of the full cluster restart
tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605)

Relates #41536
DaveCTurner added a commit that referenced this issue Jul 8, 2019
Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates #41536
DaveCTurner added a commit that referenced this issue Jul 8, 2019
Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates #41536
@DaveCTurner DaveCTurner added v7.4.0 and removed 7x labels Jul 25, 2019
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 25, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts elastic#38904 and elastic#42211
Relates elastic#41536
DaveCTurner added a commit that referenced this issue Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536
DaveCTurner added a commit that referenced this issue Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536
DaveCTurner added a commit that referenced this issue Aug 2, 2019
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.

Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Aug 5, 2019
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates elastic#41536
original-brownbear pushed a commit to original-brownbear/elasticsearch that referenced this issue Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates elastic#41536
original-brownbear pushed a commit that referenced this issue Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536
original-brownbear added a commit that referenced this issue Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536
@colings86 colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019
@DaveCTurner
Copy link
Contributor Author

The only remaining item here is to update the ReplicaShardAllocator to be sensitive to sequence numbers (i.e. leases) which is tracked in #42518, so I am closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Meta v7.4.0
Projects
None yet
Development

No branches or pull requests

3 participants