Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use index for peer recovery instead of translog #45137

Merged
merged 38 commits into from
Aug 2, 2019

Commits on Jun 19, 2019

  1. Create peer-recovery retention leases (elastic#43190)

    This creates a peer-recovery retention lease for every shard during recovery,
    ensuring that the replication group retains history for future peer recoveries.
    It also ensures that leases for active shard copies do not expire, and leases
    for inactive shard copies expire immediately if the shard is fully-allocated.
    
    Relates elastic#41536
    DaveCTurner committed Jun 19, 2019
    Configuration menu
    Copy the full SHA
    24941f2 View commit details
    Browse the repository at this point in the history
  2. Fix compilation

    DaveCTurner committed Jun 19, 2019
    Configuration menu
    Copy the full SHA
    f57ec7b View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2019

  1. Configuration menu
    Copy the full SHA
    f923fad View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2019

  1. Configuration menu
    Copy the full SHA
    d0f889d View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2019

  1. Configuration menu
    Copy the full SHA
    465ea7b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1d0930f View commit details
    Browse the repository at this point in the history

Commits on Jul 1, 2019

  1. Configuration menu
    Copy the full SHA
    4c97e8a View commit details
    Browse the repository at this point in the history
  2. Less sync

    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    f4b0742 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f8158fc View commit details
    Browse the repository at this point in the history
  4. Better test fix

    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    14ec424 View commit details
    Browse the repository at this point in the history
  5. Checkstyle

    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    a455b47 View commit details
    Browse the repository at this point in the history
  6. Advance PRRLs to match GCP of tracked shards (elastic#43751)

    This commit adjusts the behaviour of the retention lease sync to first renew
    any peer-recovery retention leases where either:
    
    - the corresponding shard's global checkpoint has advanced, or
    
    - the lease is older than half of its expiry time
    
    Relates elastic#41536
    DaveCTurner committed Jul 1, 2019
    Configuration menu
    Copy the full SHA
    c4f042b View commit details
    Browse the repository at this point in the history

Commits on Jul 4, 2019

  1. Configuration menu
    Copy the full SHA
    be80b60 View commit details
    Browse the repository at this point in the history
  2. Update BWC version for PRRLs (elastic#43959)

    This commit updates the version in which PRRLs are expected to exist to 7.4.0.
    DaveCTurner authored Jul 4, 2019
    Configuration menu
    Copy the full SHA
    570b4b9 View commit details
    Browse the repository at this point in the history
  3. Remove PRRLs before performing file-based recovery (elastic#43928)

    If the primary performs a file-based recovery to a node that has (or recently
    had) a copy of the shard then it is possible that the persisted global
    checkpoint of the new copy is behind that of the old copy since file-based
    recoveries are somewhat destructive operations.
    
    Today we leave that node's PRRL in place during the recovery with the
    expectation that it can be used by the new copy. However this isn't the case if
    the new copy needs more history to be retained, because retention leases may
    only advance and never retreat.
    
    This commit addresses this by removing any existing PRRL during a file-based
    recovery: since we are performing a file-based recovery we have already
    determined that there isn't enough history for an ops-based recovery, so there
    is little point in keeping the old lease in place.
    
    Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14)
    
    Relates elastic#41536
    DaveCTurner committed Jul 4, 2019
    Configuration menu
    Copy the full SHA
    50e9b75 View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2019

  1. Configuration menu
    Copy the full SHA
    a4d5cf1 View commit details
    Browse the repository at this point in the history
  2. Return recovery to generic thread post-PRRL action (elastic#44000)

    Today we perform `TransportReplicationAction` derivatives during recovery, and
    these actions call their response handlers on the transport thread. This change
    moves the continued execution of the recovery back onto the generic threadpool.
    DaveCTurner committed Jul 5, 2019
    Configuration menu
    Copy the full SHA
    5dd6c68 View commit details
    Browse the repository at this point in the history
  3. Skip PRRL renewal on UNASSIGNED_SEQ_NO (elastic#44019)

    Today when renewing PRRLs we assert that any invalid "backwards" renewals must
    be because we are recovering the shard. In fact it's also possible to have
    `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a
    tracked shard copy if the primary was just promoted and hasn't received
    checkpoints from all of its peers too.
    
    This commit weakens the assertion to match.
    
    Caught by a [failure of the full cluster restart
    tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605)
    
    Relates elastic#41536
    DaveCTurner committed Jul 5, 2019
    Configuration menu
    Copy the full SHA
    b9959e3 View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2019

  1. Only call assertNotTransportThread if asserts on (elastic#44028)

    In elastic#44000 we introduced some calls to `assertNotTransportThread` that are
    executed whether assertions are enabled or not. Although they have no effect if
    assertions are disabled, we should have done it like this instead.
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    fb39bb0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    59a6830 View commit details
    Browse the repository at this point in the history
  3. Create missing PRRLs after primary activation (elastic#44009)

    Today peer recovery retention leases (PRRLs) are created when starting a
    replication group from scratch and during peer recovery. However, if the
    replication group was migrated from nodes running a version which does not
    create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was
    relocated or promoted without first establishing all the expected leases.
    
    It's not possible to establish these leases before or during primary
    activation, so we must create them as soon as possible afterwards. This gives
    weaker guarantees about history retention, since there's a possibility that
    history will be discarded before it can be used. In practice such situations
    are expected to occur only rarely.
    
    This commit adds the machinery to create missing leases after primary
    activation, and strengthens the assertions about the existence of such leases
    in order to ensure that once all the leases do exist we never again enter a
    state where there's a missing lease.
    
    Relates elastic#41536
    DaveCTurner authored Jul 8, 2019
    Configuration menu
    Copy the full SHA
    4b19a4b View commit details
    Browse the repository at this point in the history
  4. Reduce number of replicas in cluster restart test

    The cluster in the full-cluster restart test only has 2 nodes, so we cannot
    fully allocate an index with 2 replicas.
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    18a0e53 View commit details
    Browse the repository at this point in the history
  5. Enable soft deletes in PRRL restart test

    The full cluster restart tests run against versions prior to 7.0 in which soft
    deletes are disabled by default, and against versions prior to 6.5 in which
    soft deletes are not even supported. This commit adjusts the PRRL full cluster
    restart test to handle such old clusters properly.
    DaveCTurner committed Jul 8, 2019
    Configuration menu
    Copy the full SHA
    f1626e9 View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2019

  1. Configuration menu
    Copy the full SHA
    aaeb1aa View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2019

  1. Configuration menu
    Copy the full SHA
    15a719e View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2019

  1. Configuration menu
    Copy the full SHA
    66583fd View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2019

  1. Configuration menu
    Copy the full SHA
    062bc8d View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2019

  1. Configuration menu
    Copy the full SHA
    85b701a View commit details
    Browse the repository at this point in the history
  2. Use global checkpoint as starting seq in ops-based recovery (elastic#…

    …43463)
    
    Today we use the local checkpoint of the safe commit on replicas as the
    starting sequence number of operation-based peer recovery. While this is
    a good choice due to its simplicity, we need to share this information
    between copies if we use retention leases in peer recovery. We can avoid
    this extra work if we use the global checkpoint as the starting sequence
    number.
    
    With this change, we will try to recover replica locally up to the
    global checkpoint before performing peer recovery. This commit should
    also increase the chance of operation-based recovery.
    dnhatn committed Jul 23, 2019
    Configuration menu
    Copy the full SHA
    8fe0cda View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2019

  1. Do not load global checkpoint to ReplicationTracker in local recovery…

    … step (elastic#44781)
    
    If we force allocate an empty or stale primary, the global checkpoint on
    replicas might be higher than the primary's as the local recovery step
    (introduced in elastic#43463) loads the previous (stale) global checkpoint into
    ReplicationTracker. There's no issue with the retention leases for a new
    lease with a higher term will supersede the stale one.
    
    Relates elastic#43463
    dnhatn committed Jul 24, 2019
    Configuration menu
    Copy the full SHA
    96d5ee7 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2019

  1. Configuration menu
    Copy the full SHA
    6f6aaca View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2019

  1. Skip local recovery for closed or frozen indices (elastic#44887)

    For closed and frozen indices, we should not recover shard locally up to
    the global checkpoint before performing peer recovery for that copy
    might be offline when the index was closed/frozen.
    
    Relates elastic#43463
    Closes elastic#44855
    dnhatn committed Jul 30, 2019
    Configuration menu
    Copy the full SHA
    f8bfcb3 View commit details
    Browse the repository at this point in the history

Commits on Jul 31, 2019

  1. Configuration menu
    Copy the full SHA
    d225b78 View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2019

  1. Configuration menu
    Copy the full SHA
    e1b059b View commit details
    Browse the repository at this point in the history
  2. Recover peers using history from Lucene (elastic#44853)

    Thanks to peer recovery retention leases we now retain the history needed to
    perform peer recoveries from the index instead of from the translog. This
    commit adjusts the peer recovery process to do so, and also adjusts it to use
    the existence of a retention lease to decide whether or not to attempt an
    operations-based recovery.
    
    Reverts elastic#38904 and elastic#42211
    Relates elastic#41536
    DaveCTurner committed Aug 1, 2019
    Configuration menu
    Copy the full SHA
    6d73b9f View commit details
    Browse the repository at this point in the history
  3. Reset starting seqno if fail to read last commit (elastic#45106)

    Previously, if the metadata snapshot is empty (either no commit found or
    error), we won't compute the starting sequence number and use -2 to opt
    out the operation-based recovery. With elastic#43463, we have a starting
    sequence number before reading the last commit. Thus, we need to reset
    it if we fail to snapshot the store.
    
    Closes elastic#45072
    dnhatn committed Aug 1, 2019
    Configuration menu
    Copy the full SHA
    513d155 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2019

  1. Configuration menu
    Copy the full SHA
    14bba51 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a4c9d56 View commit details
    Browse the repository at this point in the history