Fix issue of red index on close for remote enabled clusters #15990

ashking94 · 2024-09-19T06:58:29Z

Description

The close index operation involves following steps -

Start closing indices by adding a write block
Wait for the operations on the shards to be completed
1. Acquire all indexing operation permits to ensure that all operations have completed indexing
After acquiring all indexing permits, closing a index involves 2 phases -
1. Sync translog
2. Flush Index Shard
Move index states from OPEN to CLOSE in cluster state for indices that are ready for closing

During a happy index close, we upload translog twice -

1st time, as part of the 3.a. Sync Translog step, the indexing operations are uploaded
2nd time, as part of the 3.b. Flush Index Shard step, the latest GCP is uploaded.

However, if there is a flush that has happened after the operation landed in the Lucene buffer but before the buffered sync (for sync translog) or the periodic async sync (for async translog), then the steps 3(a) and 3(b) becomes no-op and the GCP uploaded in the checkpoint file would be the one from the last translog sync. This causes the discrepancy between maxSeqNo and GCP and causing exception while creating ReadOnlyEngine leading to red index.

In this PR, changes are made to track the global checkpoint that has been updated as part of the successful translog upload to remote store. The new tracked global checkpoint is now also used in the RemoteFsTranslog.syncNeeded() method and checked against the current (translog writer) last synced global checkpoint.

Related Issues

Resolves #15989

Check List

Functionality includes testing.
~~[ ] API changes companion pull request created, if applicable.~~
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-09-19T07:55:17Z

❌ Gradle check result for a1d5a87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ashking94 · 2024-09-19T09:00:55Z

❌ Gradle check result for a1d5a87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

I have added tests for the edge case that is mentioned in the referenced issue. I had added the tests first and then the main code changes -

[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog/)
[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog_2/)
[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog_3/)
[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog_4/)

github-actions · 2024-09-19T09:11:23Z

❌ Gradle check result for ac864f0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-19T10:41:13Z

❌ Gradle check result for 29cf87f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ashking94 · 2024-09-19T11:30:23Z

❌ Gradle check result for 29cf87f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Fixed this.

github-actions · 2024-09-19T11:34:03Z

❕ Gradle check result for 4420487: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2024-09-19T11:34:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.90%. Comparing base (036f6bc) to head (27b6828).
Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #15990      +/-   ##
============================================
- Coverage     71.92%   71.90%   -0.03%     
+ Complexity    64400    64384      -16     
============================================
  Files          5281     5281              
  Lines        300995   301000       +5     
  Branches      43479    43481       +2     
============================================
- Hits         216491   216428      -63     
- Misses        66793    66816      +23     
- Partials      17711    17756      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Bukhtawar

Wondering is this the problem with just remote translog or local translog as well?

ashking94 · 2024-09-23T15:48:13Z

Wondering is this the problem with just remote translog or local translog as well?

The problem seems to exist for remote translog only since the local version seems fine. When we close the index (in case of remote translog), the translog is wiped out locally first and the rehydrated from remote store. At the point, the most recent checkpoint file downloaded has a global checkpoint from the last but 1 translog sync.

Signed-off-by: Ashish Singh <[email protected]>

github-actions · 2024-09-23T17:15:59Z

✅ Gradle check result for 27b6828: SUCCESS

gbbafna

Changes looks great .

gbbafna · 2024-09-24T11:21:29Z

server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreIT.java

+        latch.await();
+        // Sleep for some time for the next doc to be present in lucene buffer. If flush happens first before the doc #2
+        // gets indexed, then it goes into the happy case where the close index happens succefully.
+        Thread.sleep(1000);


curious why do we need sleep here ? As we have already indexed docs and they should be in lucene buffer.

gbbafna · 2024-09-24T11:25:50Z

server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreIT.java

+        ensureGreen(INDEX_NAME);
+    }
+
+    public void testCloseIndexWithNoOpSyncAndFlushForAsyncTranslog() throws InterruptedException {


can we write UTs as well for same ?

ashking94 added backport 2.x Backport to 2.x branch skip-changelog labels Sep 19, 2024

ashking94 force-pushed the fix-issue-15989 branch from ac864f0 to 29cf87f Compare September 19, 2024 09:58

ashking94 changed the title ~~[Remote Store] Emit correct global checkpoint during translog upload~~ Fix issue of red index on close for remote enabled clusters Sep 19, 2024

ashking94 marked this pull request as ready for review September 19, 2024 10:06

github-actions bot added bug Something isn't working Storage:Remote labels Sep 19, 2024

This was referenced Sep 19, 2024

[AUTOCUT] Gradle Check Flaky Test Report for ResourceAwareTasksTests #14293

Open

[AUTOCUT] Gradle Check Flaky Test Report for RemoteFsTimestampAwareTranslogTests #15818

Open

[AUTOCUT] Gradle Check Flaky Test Report for InternalEngineTests #15838

Open

Bukhtawar reviewed Sep 22, 2024

View reviewed changes

ashking94 force-pushed the fix-issue-15989 branch from 4420487 to 89789f4 Compare September 23, 2024 16:06

Fix red index on close for remote translog

27b6828

Signed-off-by: Ashish Singh <[email protected]>

ashking94 force-pushed the fix-issue-15989 branch from 89789f4 to 27b6828 Compare September 23, 2024 16:21

gbbafna reviewed Sep 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue of red index on close for remote enabled clusters #15990

Fix issue of red index on close for remote enabled clusters #15990

ashking94 commented Sep 19, 2024 •

edited

Loading

github-actions bot commented Sep 19, 2024

ashking94 commented Sep 19, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 19, 2024

ashking94 commented Sep 19, 2024

github-actions bot commented Sep 19, 2024

codecov bot commented Sep 19, 2024 •

edited

Loading

Bukhtawar left a comment

ashking94 commented Sep 23, 2024 •

edited

Loading

github-actions bot commented Sep 23, 2024

gbbafna left a comment

gbbafna Sep 24, 2024

gbbafna Sep 24, 2024

Fix issue of red index on close for remote enabled clusters #15990

Are you sure you want to change the base?

Fix issue of red index on close for remote enabled clusters #15990

Conversation

ashking94 commented Sep 19, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Sep 19, 2024

ashking94 commented Sep 19, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 19, 2024

ashking94 commented Sep 19, 2024

github-actions bot commented Sep 19, 2024

codecov bot commented Sep 19, 2024 • edited Loading

Codecov Report

Bukhtawar left a comment

Choose a reason for hiding this comment

ashking94 commented Sep 23, 2024 • edited Loading

github-actions bot commented Sep 23, 2024

gbbafna left a comment

Choose a reason for hiding this comment

gbbafna Sep 24, 2024

Choose a reason for hiding this comment

gbbafna Sep 24, 2024

Choose a reason for hiding this comment

ashking94 commented Sep 19, 2024 •

edited

Loading

codecov bot commented Sep 19, 2024 •

edited

Loading

ashking94 commented Sep 23, 2024 •

edited

Loading