Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use index for peer recovery instead of translog #45137

Merged
merged 38 commits into from
Aug 2, 2019

Conversation

DaveCTurner
Copy link
Contributor

Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.

Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.

DaveCTurner and others added 30 commits June 19, 2019 17:41
This creates a peer-recovery retention lease for every shard during recovery,
ensuring that the replication group retains history for future peer recoveries.
It also ensures that leases for active shard copies do not expire, and leases
for inactive shard copies expire immediately if the shard is fully-allocated.

Relates elastic#41536
This commit adjusts the behaviour of the retention lease sync to first renew
any peer-recovery retention leases where either:

- the corresponding shard's global checkpoint has advanced, or

- the lease is older than half of its expiry time

Relates elastic#41536
This commit updates the version in which PRRLs are expected to exist to 7.4.0.
If the primary performs a file-based recovery to a node that has (or recently
had) a copy of the shard then it is possible that the persisted global
checkpoint of the new copy is behind that of the old copy since file-based
recoveries are somewhat destructive operations.

Today we leave that node's PRRL in place during the recovery with the
expectation that it can be used by the new copy. However this isn't the case if
the new copy needs more history to be retained, because retention leases may
only advance and never retreat.

This commit addresses this by removing any existing PRRL during a file-based
recovery: since we are performing a file-based recovery we have already
determined that there isn't enough history for an ops-based recovery, so there
is little point in keeping the old lease in place.

Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)

Relates elastic#41536
Today we perform `TransportReplicationAction` derivatives during recovery, and
these actions call their response handlers on the transport thread. This change
moves the continued execution of the recovery back onto the generic threadpool.
Today when renewing PRRLs we assert that any invalid "backwards" renewals must
be because we are recovering the shard. In fact it's also possible to have
`checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a
tracked shard copy if the primary was just promoted and hasn't received
checkpoints from all of its peers too.

This commit weakens the assertion to match.

Caught by a [failure of the full cluster restart
tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605)

Relates elastic#41536
In elastic#44000 we introduced some calls to `assertNotTransportThread` that are
executed whether assertions are enabled or not. Although they have no effect if
assertions are disabled, we should have done it like this instead.
Today peer recovery retention leases (PRRLs) are created when starting a
replication group from scratch and during peer recovery. However, if the
replication group was migrated from nodes running a version which does not
create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
relocated or promoted without first establishing all the expected leases.

It's not possible to establish these leases before or during primary
activation, so we must create them as soon as possible afterwards. This gives
weaker guarantees about history retention, since there's a possibility that
history will be discarded before it can be used. In practice such situations
are expected to occur only rarely.

This commit adds the machinery to create missing leases after primary
activation, and strengthens the assertions about the existence of such leases
in order to ensure that once all the leases do exist we never again enter a
state where there's a missing lease.

Relates elastic#41536
The cluster in the full-cluster restart test only has 2 nodes, so we cannot
fully allocate an index with 2 replicas.
The full cluster restart tests run against versions prior to 7.0 in which soft
deletes are disabled by default, and against versions prior to 6.5 in which
soft deletes are not even supported. This commit adjusts the PRRL full cluster
restart test to handle such old clusters properly.
…43463)

Today we use the local checkpoint of the safe commit on replicas as the
starting sequence number of operation-based peer recovery. While this is
a good choice due to its simplicity, we need to share this information
between copies if we use retention leases in peer recovery. We can avoid
this extra work if we use the global checkpoint as the starting sequence
number.

With this change, we will try to recover replica locally up to the
global checkpoint before performing peer recovery. This commit should
also increase the chance of operation-based recovery.
… step (elastic#44781)

If we force allocate an empty or stale primary, the global checkpoint on
replicas might be higher than the primary's as the local recovery step
(introduced in elastic#43463) loads the previous (stale) global checkpoint into
ReplicationTracker. There's no issue with the retention leases for a new
lease with a higher term will supersede the stale one.

Relates elastic#43463
DaveCTurner and others added 8 commits July 29, 2019 23:50
For closed and frozen indices, we should not recover shard locally up to
the global checkpoint before performing peer recovery for that copy
might be offline when the index was closed/frozen.

Relates elastic#43463
Closes elastic#44855
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts elastic#38904 and elastic#42211
Relates elastic#41536
Previously, if the metadata snapshot is empty (either no commit found or
error), we won't compute the starting sequence number and use -2 to opt
out the operation-based recovery. With elastic#43463, we have a starting
sequence number before reading the last commit. Thus, we need to reset
it if we fail to snapshot the store.

Closes elastic#45072
@DaveCTurner DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. backport v7.4.0 labels Aug 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 9ff320d into elastic:7.x Aug 2, 2019
@DaveCTurner DaveCTurner deleted the 2019-08-02-merge-prrls-7.x branch August 2, 2019 14:00
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v7.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants