You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have seen multiple cluster now that has turned red during close. Closing an index involves closing the IndexShard and reopening the same with ReadOnlyEngine. ReadOnlyEngine has an extra validation that checks that the maxSeqNo (which is obtained from the segments) is equal to the global checkpoint a.k.a GCP (which is obtained from translog.ckp). In the case where the closed index is turning red, I am seeing exceptions showing that the maxSeqNo is greater than the GCP.
This is the below check that leads to Engine failure.
private static SeqNoStats buildSeqNoStats(EngineConfig config, SegmentInfos infos) {
final SequenceNumbers.CommitInfo seqNoStats = SequenceNumbers.loadSeqNoInfoFromLuceneCommit(infos.userData.entrySet());
long maxSeqNo = seqNoStats.maxSeqNo;
long localCheckpoint = seqNoStats.localCheckpoint;
return new SeqNoStats(maxSeqNo, localCheckpoint, config.getGlobalCheckpointSupplier().getAsLong());
}
protected void ensureMaxSeqNoEqualsToGlobalCheckpoint(final SeqNoStats seqNoStats) {
if (requireCompleteHistory == false) {
return;
}
// Before 8.0 the global checkpoint is not known and up to date when the engine is created after
// peer recovery, so we only check the max seq no / global checkpoint coherency when the global
// checkpoint is different from the unassigned sequence number value.
// In addition to that we only execute the check if the index the engine belongs to has been
// created after the refactoring of the Close Index API and its TransportVerifyShardBeforeCloseAction
// that guarantee that all operations have been flushed to Lucene.
final Version indexVersionCreated = engineConfig.getIndexSettings().getIndexVersionCreated();
if (indexVersionCreated.onOrAfter(LegacyESVersion.V_7_2_0)
|| (seqNoStats.getGlobalCheckpoint() != SequenceNumbers.UNASSIGNED_SEQ_NO)) {
assert assertMaxSeqNoEqualsToGlobalCheckpoint(seqNoStats.getMaxSeqNo(), seqNoStats.getGlobalCheckpoint());
if (seqNoStats.getMaxSeqNo() != seqNoStats.getGlobalCheckpoint()) {
throw new IllegalStateException(
"Maximum sequence number ["
+ seqNoStats.getMaxSeqNo()
+ "] from last commit does not match global checkpoint ["
+ seqNoStats.getGlobalCheckpoint()
+ "]"
);
}
}
}
Root cause
The close index operation involves following steps -
Start closing indices by adding a write block
Wait for the operations on the shards to be completed
Acquire all indexing operation permits to ensure that all operations have completed indexing
After acquiring all indexing permits, closing a index involves 2 phases -
Sync translog
Flush Index Shard
Move index states from OPEN to CLOSE in cluster state for indices that are ready for closing
During a happy index close, we upload translog twice -
1st time, as part of the 3.a. Sync Translog step, the indexing operations are uploaded
2nd time, as part of the 3.b. Flush Index Shard step, the latest GCP is uploaded.
However, if there is a flush that has happened after the operation landed in the Lucene buffer but before the buffered sync (for sync translog) or the periodic async sync (for async translog), then the steps 3(a) and 3(b) becomes no-op and the GCP uploaded in the checkpoint file would be the one from the last translog sync. This causes the discrepancy between maxSeqNo and GCP and causing exception while creating ReadOnlyEngine leading to red index.
Related component
Storage:Remote
To Reproduce
We can reproduce this issue for both sync and async translog quite easily -
Sync translog
Step 1 - Set final settting indices.memory.shard_inactive_time as 10h
for i in {1..1}; do
echo $i
curl -X PUT "localhost:9200/index{$i}?pretty" -H 'Content-Type: application/json' -d'
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
}
'
done
Step 4 -
for j in {1..1}; do
for i in {1..1}; do
echo $i
curl --location --request POST "localhost:9200/index{$i}/_doc?pretty" \
--header 'Content-Type: application/json' \
--data-raw "{
\"name1\":\"Ashish\"
}"
done
done
Step 5 (this should immediately after step 4) -
curl localhost:9201/*/_flush
Step 6 (this should immediately after step 5) -
curl -XPOST localhost:9200/index1/_close
You should see exception stack trace like below -
opensearch-master1 | [2024-09-18T10:54:33,058][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] failing shard [failed shard, shard [index1][0], node[A8Yi-Gj_RUGAdYdtQviB6w], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=K0EWkYOySu6Q44PI2dkXEg], unassigned_info[[reason=ALLOCATION_FAILED], at[2024-09-18T10:54:32.935Z], failed_attempts[4], failed_nodes[[A8Yi-Gj_RUGAdYdtQviB6w]], delayed=false, details[failed shard on node [A8Yi-Gj_RUGAdYdtQviB6w]: failed recovery, failure RecoveryFailedException[[index1][0]: Recovery failed on {opensearch-node1}{A8Yi-Gj_RUGAdYdtQviB6w}{JsaEQ1-KTkWXZb9P_o8g-w}{172.18.0.4}{172.18.0.4:9300}{dir}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [7] from last commit does not match global checkpoint [6]]; ], allocation_status[no_valid_shard_copy]], message [failed recovery], failure [RecoveryFailedException[[index1][0]: Recovery failed on {opensearch-node1}{A8Yi-Gj_RUGAdYdtQviB6w}{JsaEQ1-KTkWXZb9P_o8g-w}{172.18.0.4}{172.18.0.4:9300}{dir}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [7] from last commit does not match global checkpoint [6]]; ], markAsStale [true]]
opensearch-master1 | org.opensearch.indices.recovery.RecoveryFailedException: [index1][0]: Recovery failed on {opensearch-node1}{A8Yi-Gj_RUGAdYdtQviB6w}{JsaEQ1-KTkWXZb9P_o8g-w}{172.18.0.4}{172.18.0.4:9300}{dir}{shard_indexing_pressure_enabled=true}
opensearch-master1 | at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$33(IndexShard.java:3893) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:627) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:124) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2902) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-master1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-master1 | at java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-master1 | Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
opensearch-master1 | ... 11 more
opensearch-master1 | Caused by: java.lang.IllegalStateException: Maximum sequence number [7] from last commit does not match global checkpoint [6]
opensearch-master1 | at org.opensearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:201) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:143) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:77) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2612) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2519) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2491) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:751) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:126) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1 | ... 8 more
opensearch-master1 | [2024-09-18T10:54:33,059][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1 | [2024-09-18T10:54:33,061][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1 | [2024-09-18T10:54:33,072][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1 | [2024-09-18T10:54:33,073][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1 | [2024-09-18T10:54:33,084][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1 | [2024-09-18T10:54:33,084][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1 | [2024-09-18T10:57:20,901][INFO ][o.o.g.r.RemoteClusterStateCleanupManager] [opensearch-master1] Cleaning up stale remote state files for cluster [opensearch-cluster] with uuid [lP5vHY8eTCmZxR3EVb5wGQ]. Last clean was done before 20 updates
Async translog
This can be reproduced similarly for async translog as well.
Expected behavior
In the happy flow, the close index works normally. In the edge cases, I discovered above, this should work normally and must not lead to red cluster.
Additional Details
NA
The text was updated successfully, but these errors were encountered:
Issue Summary
I have seen multiple cluster now that has turned red during close. Closing an index involves closing the IndexShard and reopening the same with ReadOnlyEngine. ReadOnlyEngine has an extra validation that checks that the maxSeqNo (which is obtained from the segments) is equal to the global checkpoint a.k.a GCP (which is obtained from translog.ckp). In the case where the closed index is turning red, I am seeing exceptions showing that the maxSeqNo is greater than the GCP.
This is the below check that leads to Engine failure.
Root cause
The close index operation involves following steps -
During a happy index close, we upload translog twice -
However, if there is a flush that has happened after the operation landed in the Lucene buffer but before the buffered sync (for sync translog) or the periodic async sync (for async translog), then the steps 3(a) and 3(b) becomes no-op and the GCP uploaded in the checkpoint file would be the one from the last translog sync. This causes the discrepancy between maxSeqNo and GCP and causing exception while creating ReadOnlyEngine leading to red index.
Related component
Storage:Remote
To Reproduce
We can reproduce this issue for both sync and async translog quite easily -
Sync translog
Step 1 - Set final settting indices.memory.shard_inactive_time as 10h
Step 2 -
Step 3 -
Step 4 -
Step 5 (this should immediately after step 4) -
Step 6 (this should immediately after step 5) -
You should see exception stack trace like below -
Async translog
This can be reproduced similarly for async translog as well.
Expected behavior
In the happy flow, the close index works normally. In the edge cases, I discovered above, this should work normally and must not lead to red cluster.
Additional Details
NA
The text was updated successfully, but these errors were encountered: