Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation #8889

Merged
merged 8 commits into from
Aug 3, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jul 26, 2023

Description

This test is now occasionally failing with replicas having 0 documents while expecting to be caught up to the primary. This occurs in a couple of ways:

  1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode. If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
  • This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
  1. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump. However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
  • This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.

  • Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed. This ensures checks for primary term are accurate and not using a locally computed operationPrimaryTerm.
  • Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
  • Removes unnecessary onCheckpointPublished method used to start replication timers manually. This will happen automatically on primaries through the refresh listener.
  • Implements segmentInfosSnapshot method for NRTReplicationEngine, ensuring we do not remove required segments while computing ReplicationCheckpoint post refresh.

Related Issues

Resolves #8059

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@Poojita-Raj
Copy link
Contributor

#8850
Fixed: #8863
New failure: org.opensearch.index.shard.SegmentReplicationWithRemoteIndexShardTests.testReplicaSyncingFromRemoteStore -

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #8889 (d0e15b8) into main (9720528) will decrease coverage by 0.03%.
Report is 2 commits behind head on main.
The diff coverage is 69.23%.

@@             Coverage Diff              @@
##               main    #8889      +/-   ##
============================================
- Coverage     71.01%   70.99%   -0.03%     
+ Complexity    57251    57223      -28     
============================================
  Files          4765     4765              
  Lines        270334   270357      +23     
  Branches      39538    39541       +3     
============================================
- Hits         191991   191950      -41     
- Misses        62176    62187      +11     
- Partials      16167    16220      +53     
Files Changed Coverage Δ
...search/index/shard/RemoteStoreRefreshListener.java 83.06% <0.00%> (-1.48%) ⬇️
...ckpoint/SegmentReplicationCheckpointPublisher.java 100.00% <ø> (ø)
...s/replication/SegmentReplicationTargetService.java 60.00% <57.69%> (+0.74%) ⬆️
.../opensearch/index/engine/NRTReplicationEngine.java 78.98% <66.66%> (-0.49%) ⬇️
.../indices/replication/SegmentReplicationTarget.java 89.71% <85.71%> (+0.39%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 69.49% <86.95%> (-0.20%) ⬇️

... and 495 files with indirect coverage changes

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':distribution:bwc:staged:buildBwcLinuxTar'.
> Building 2.9.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/staged/build/bwc/checkout-2.9/distribution/archives/linux-tar/build/distributions/opensearch-min-2.9.0-SNAPSHOT-linux-x64.tar.gz

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Poojita-Raj
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

* **RESULT:** FAILURE ❌

* **URL:** https://build.ci.opensearch.org/job/gradle-check/21068/

* **CommitID:** [8476d5d](https://github.com/opensearch-project/OpenSearch/commit/8476d5d11705d84da690bb8935aca70da593327e)
  Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
  Is the failure [a flaky test](https://github.com/opensearch-project/OpenSearch/blob/main/DEVELOPER_GUIDE.md#flaky-tests) unrelated to your change?
Execution failed for task ':distribution:bwc:staged:buildBwcLinuxTar'.
Building 2.9.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/staged/build/bwc/checkout-2.9/distribution/archives/linux-tar/build/distributions/opensearch-min-2.9.0-SNAPSHOT-linux-x64.tar.gz

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

#8932

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

#8932 again

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries

@mch2 mch2 marked this pull request as ready for review July 27, 2023 23:08
…ckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>
To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@github-actions

This comment was marked as outdated.

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/security-analytics.git]
Compatible components: [https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

BUILD SUCCESSFUL in 25m 32s

@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

Execution failed for task ':test:fixtures:krb5kdc-fixture:composeBuild'.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 merged commit c3acf47 into opensearch-project:main Aug 3, 2023
8 of 10 checks passed
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Aug 3, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-8889-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 c3acf47b4d643c3a3ab86dc3b07fe722ac6e4982
# Push it to GitHub
git push --set-upstream origin backport/backport-8889-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-8889-to-2.x.

VachaShah pushed a commit to VachaShah/OpenSearch that referenced this pull request Aug 3, 2023
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
mch2 added a commit to mch2/OpenSearch that referenced this pull request Aug 3, 2023
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit c3acf47)
mch2 added a commit to mch2/OpenSearch that referenced this pull request Aug 3, 2023
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit c3acf47)
mch2 added a commit to mch2/OpenSearch that referenced this pull request Aug 3, 2023
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit c3acf47)
mch2 added a commit that referenced this pull request Aug 3, 2023
…ckpoint validation (#8889) (#9095)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit c3acf47)
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Kaushal Kumar <[email protected]>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Ivan Brusic <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ckpoint validation (opensearch-project#8889)

* Fix test testDropPrimaryDuringReplication and clean up ReplicationCheckpoint validation.

This test is now occasionally failing with replicas having 0 documents. This occurs in a couple of ways:
1. After dropping the old primary the new primary is not publishing a checkpoint to replicas unless it indexes docs from translog after flipping to primary mode.
If there is nothing to index, it will not publish a checkpoint, but the other replica could have never sync'd with the original primary and be left out of date.
- This PR fixes this by force publishing a checkpoint after the new primary flips to primary mode.
2. The replica receives a checkpoint post failover and cancels its sync with the former primary that is still active, recognizing a primary term bump.
However this cancellation is async and immediately starting a new replication event could fail as its still replicating.
- This PR fixes this by attempting to process the latest received checkpoint on failure, if the shard is not failed and still behind.

This PR also introduces a few changes to ensure the accuracy of the ReplicationCheckpoint tracked on primary & replicas.
- Ensure the checkpoint stored in SegmentReplicationTarget is the checkpoint passed from the primary and not locally computed.  This ensures checks for primary term are accurate and not using a locally compued operationPrimaryTerm.
- Introduces a refresh listener for both primary & replica to update the ReplicationCheckpoint and store it in replicationTracker post refresh rather than redundantly computing when accessed.
- Removes unnecessary onCheckpointPublished method used to start replication timers manually.  This will happen automatically on primaries once its local cp is updated.

Signed-off-by: Marc Handalian <[email protected]>

* Handle NoSuchFileException when attempting to delete decref'd files.

To avoid divergent logic with remote store, we always incref/decref the segmentinfos.files(true) which includes the segments_n file.
Decref to 0 will attempt to delete the file from the store and its possible this _n file does not yet exist. This change will ignore if we get a noSuchFile while attempting to delete.

Signed-off-by: Marc Handalian <[email protected]>

* Add more unit tests.

Signed-off-by: Marc Handalian <[email protected]>

* Clean up IndexShardTests.testCheckpointReffreshListenerWithNull

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary catch for NoSuchFileException.

Signed-off-by: Marc Handalian <[email protected]>

* Add another test for non segrep.

Signed-off-by: Marc Handalian <[email protected]>

* PR Feedback.

Signed-off-by: Marc Handalian <[email protected]>

* re-compute replication checkpoint on primary promotion.

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
4 participants