Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support shard promotion with Segment Replication. #4135

Merged
merged 5 commits into from
Aug 17, 2022

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Aug 4, 2022

Signed-off-by: Marc Handalian [email protected]

Description

This change adds basic failover support with segment replication. Once selected, a replica will commit its SegmentInfos and reopen a writeable engine. The replica will also remove all other commits so that this commit is selected when the writeable engine is opened. It is possible that this commit is not considered 'safe' by the primary, meaning its max seqNo is higher than the global cp. While an edge case, we never want replicas to reindex with segment replication enabled, so if the global cp has not been updated yet we do not want to revert to a safe commit. This change also updates how SegmentReplicationCheckpointPublisher is wired up within IndexShard so that once promoted the new primary can publish checkpoints.

This PR does not handle edge cases of promotion while there are ongoing replication events, that will be covered in a separate issue.

Issues Resolved

closes #3989

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 5, 2022

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2022

Codecov Report

Merging #4135 (28fea5f) into main (237f1a5) will decrease coverage by 0.00%.
The diff coverage is 76.92%.

@@             Coverage Diff              @@
##               main    #4135      +/-   ##
============================================
- Coverage     70.65%   70.64%   -0.01%     
- Complexity    57075    57145      +70     
============================================
  Files          4606     4606              
  Lines        274706   274737      +31     
  Branches      40228    40231       +3     
============================================
- Hits         194103   194099       -4     
- Misses        64280    64374      +94     
+ Partials      16323    16264      -59     
Impacted Files Coverage Δ
...c/main/java/org/opensearch/index/IndexService.java 73.86% <0.00%> (-0.23%) ⬇️
...nsearch/index/shard/CheckpointRefreshListener.java 88.88% <0.00%> (-11.12%) ⬇️
...in/java/org/opensearch/index/shard/IndexShard.java 69.07% <73.33%> (-0.58%) ⬇️
...rc/main/java/org/opensearch/index/store/Store.java 81.30% <81.25%> (-0.60%) ⬇️
.../opensearch/index/engine/NRTReplicationEngine.java 76.92% <100.00%> (+1.52%) ⬆️
...ation/OpenSearchIndexLevelReplicationTestCase.java 89.81% <100.00%> (+0.02%) ⬆️
...java/org/opensearch/client/indices/DataStream.java 0.00% <0.00%> (-76.09%) ⬇️
...n/indices/forcemerge/ForceMergeRequestBuilder.java 0.00% <0.00%> (-75.00%) ⬇️
...adonly/AddIndexBlockClusterStateUpdateRequest.java 0.00% <0.00%> (-75.00%) ⬇️
.../opensearch/client/indices/CloseIndexResponse.java 17.50% <0.00%> (-60.00%) ⬇️
... and 499 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@mch2 mch2 requested review from Bukhtawar and nknize August 8, 2022 18:20
@mch2
Copy link
Member Author

mch2 commented Aug 8, 2022

I've added a commit to this ensuring cancelling primary allocation succeeds and that the replica is promoted & primary recreated as a replica. In testing that I found we were failing to publish a replication checkpoint if the primary flushed during close. That is now fixed, the shard must be open for us to publish the replication cp.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

This change adds basic failover support with segment replication.  Once selected, a replica will commit and reopen a writeable engine.

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 merged commit f65e02d into opensearch-project:main Aug 17, 2022
@mch2 mch2 deleted the failover branch August 17, 2022 17:32
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Aug 17, 2022
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-4135-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f65e02d1b910bd0a1990868bfa5d12ba829bbbd5
# Push it to GitHub
git push --set-upstream origin backport/backport-4135-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-4135-to-2.x.

mch2 added a commit to mch2/OpenSearch that referenced this pull request Aug 29, 2022
…nsearch-project#4135)

* Support shard promotion with Segment Replication.

This change adds basic failover support with segment replication.  Once selected, a replica will commit and reopen a writeable engine.

Signed-off-by: Marc Handalian <[email protected]>

* Add check to ensure a closed shard does not publish checkpoints.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up in SegmentReplicationIT.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

* Fix merge conflict.

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>
mch2 added a commit that referenced this pull request Aug 30, 2022
…) (#4325)

* Support shard promotion with Segment Replication.

This change adds basic failover support with segment replication.  Once selected, a replica will commit and reopen a writeable engine.

Signed-off-by: Marc Handalian <[email protected]>

* Add check to ensure a closed shard does not publish checkpoints.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up in SegmentReplicationIT.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

* Fix merge conflict.

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Segment Replication] Swap replica to writeable engine during failover.
5 participants