Ignore Lucene index in peer recovery if translog corrupted #49114

dnhatn · 2019-11-14T21:48:56Z

If the translog on a replica is corrupt, we should not perform an operation-based recovery or utilize sync_id as we won't be able to open an engine in the next step. This change adds an extra validation that ensures translog is okay when preparing a peer recovery request.

elasticmachine · 2019-11-14T21:48:58Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

dnhatn · 2019-11-14T22:08:32Z

An alternative is to mark the store as corrupted if we failed to locally recover up to the global checkpoint. But it seems harsh, and we also need to handle a situation where we failed to mark the store as corrupted.

dnhatn · 2019-11-14T22:35:43Z

run elasticsearch-ci/1

ywelsch

LGTM

henningandersen

LGTM2

dnhatn · 2019-11-18T16:29:13Z

Thanks Yannick and Henning.

If the translog on a replica is corrupt, we should not perform an operation-based recovery or utilize sync_id as we won't be able to open an engine in the next step. This change adds an extra validation that ensures translog is okay when preparing a peer recovery request.

hackerwin7 · 2020-09-21T10:45:03Z

before this patch, peer recovery maybe occur a exception like this:

Caused by: java.nio.file.NoSuchFileException: /data02/es_data/nodes/0/indices/2k_Oju9dRuqWx4tzbos53g/0/translog/translog.ckp
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:215) ~[?:?]
        at java.nio.file.Files.newByteChannel(Files.java:370) ~[?:?]
        at java.nio.file.Files.newByteChannel(Files.java:421) ~[?:?]
        at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:77) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28]
        at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:188) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1847) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1865) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1860) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1483) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1455) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$prepareForTranslogOperations$0(RecoveryTarget.java:302) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:197) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:297) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:436) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:430) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1087) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) ~[?:?]

Ignore Lucene index in peer recovery if translog corrupted

0acb4e1

dnhatn added >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.5.0 v6.8.5 v7.4.3 labels Nov 14, 2019

dnhatn requested review from ywelsch and henningandersen November 14, 2019 21:48

adjust assertion

634331d

use internal IOUtils

8320db6

ywelsch approved these changes Nov 15, 2019

View reviewed changes

henningandersen approved these changes Nov 15, 2019

View reviewed changes

dnhatn merged commit 5aa5d7b into elastic:master Nov 18, 2019

dnhatn deleted the ignore-index-if-tlog-corrupted branch November 18, 2019 16:29

dnhatn added the backport pending label Nov 18, 2019

jaymode added v6.8.6 and removed v6.8.5 labels Nov 19, 2019

dnhatn added the v7.6.0 label Nov 24, 2019

dnhatn removed the backport pending label Nov 25, 2019

russcam mentioned this pull request Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

russcam mentioned this pull request Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

mfussenegger mentioned this pull request Mar 26, 2020

ES Backports crate/crate#9796

Closed

37 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore Lucene index in peer recovery if translog corrupted #49114

Ignore Lucene index in peer recovery if translog corrupted #49114

dnhatn commented Nov 14, 2019

elasticmachine commented Nov 14, 2019

dnhatn commented Nov 14, 2019

dnhatn commented Nov 14, 2019

ywelsch left a comment

henningandersen left a comment

dnhatn commented Nov 18, 2019

hackerwin7 commented Sep 21, 2020

Ignore Lucene index in peer recovery if translog corrupted #49114

Ignore Lucene index in peer recovery if translog corrupted #49114

Conversation

dnhatn commented Nov 14, 2019

elasticmachine commented Nov 14, 2019

dnhatn commented Nov 14, 2019

dnhatn commented Nov 14, 2019

ywelsch left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

dnhatn commented Nov 18, 2019

hackerwin7 commented Sep 21, 2020