Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [remote store] FileAlreadyExistsException causes Replication failed #9556

Closed
tlfeng opened this issue Aug 25, 2023 · 2 comments · Fixed by #9630
Closed

[BUG] [remote store] FileAlreadyExistsException causes Replication failed #9556

tlfeng opened this issue Aug 25, 2023 · 2 comments · Fixed by #9630
Assignees
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing v2.10.0

Comments

@tlfeng
Copy link
Collaborator

tlfeng commented Aug 25, 2023

Describe the bug
Found a problem during running performance test against OpenSearch built from code in 2.x branch.
The index is red during indexing operation

$ curl -XGET "opens-clust-xxx.elb.us-west-2.amazonaws.com/_cat/indices?v"
health status index       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   logs-1600gb qSnoT1VTSiuipllqbCJfTA  40   1 1596892515            3    244.2gb        140.5gb

The below is error messages from OpenSearch log on data nodes:
Once the FileAlreadyExistsException occurs, the error will keep showing forever until the index being red and caused permanent failure.

[2023-08-25T04:38:33,032][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-197.us-west-2.compute.internal] [shardId 39] [replication id 344583] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=93, FILE_DIFF=3, REPLICATING=0}

org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:511) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:171) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:494) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:480) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Caused by: java.lang.ArithmeticException: long overflow
	at java.lang.Math.addExact(Math.java:903) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:122) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.reopenStreamOrFail(S3RetryingInputStream.java:236) ~[?:?]
	at org.opensearch.repositories.s3.S3RetryingInputStream.read(S3RetryingInputStream.java:192) ~[?:?]
	at org.opensearch.index.store.RemoteIndexInput.readBytes(RemoteIndexInput.java:59) ~[opensearch-2.10.0.jar:2.10.0]
	at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:289) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.Directory.copyFrom(Directory.java:182) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:955) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:119) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:168) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:164) ~[opensearch-2.10.0.jar:2.10.0]
	... 7 more
[2023-08-25T04:38:33,247][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-197.us-west-2.compute.internal] Exception replicating Id:[344584] Checkpoint [ReplicationCheckpoint{shardId=[logs-1600gb][39], primaryTerm=1, segmentsGen=3, version=4999, size=301229974, codec=Lucene95}] Shard:[[logs-1600gb][39]] Source:[RemoteStoreReplicationSource] marking as failed.

org.opensearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
	at org.opensearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:109) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:101) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:171) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:494) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:480) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Caused by: java.util.concurrent.ExecutionException: java.nio.file.FileAlreadyExistsException: /home/ec2-user/opensearch/data/nodes/0/indices/qSnoT1VTSiuipllqbCJfTA/39/index/_1xb.cfe
	at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:260) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94) ~[opensearch-2.10.0.jar:2.10.0]
	... 14 more

Caused by: java.nio.file.FileAlreadyExistsException: /home/ec2-user/opensearch/data/nodes/0/indices/qSnoT1VTSiuipllqbCJfTA/39/index/_1xb.cfe
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:94) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218) ~[?:?]
	at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484) ~[?:?]
	at java.nio.file.Files.newOutputStream(Files.java:228) ~[?:?]
	at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:394) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:387) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:220) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:153) ~[opensearch-2.10.0.jar:2.10.0]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.Directory.copyFrom(Directory.java:181) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:955) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:119) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:168) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:164) ~[opensearch-2.10.0.jar:2.10.0]
	... 7 more
	Suppressed: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 495; received: 0)
		at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[?:?]
		at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198) ~[?:?]
		at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101) ~[?:?]
		at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142) ~[?:?]
		at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228) ~[?:?]
		at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172) ~[?:?]
		at java.io.FilterInputStream.close(FilterInputStream.java:179) ~[?:?]
		at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.close(ChecksumValidatingInputStream.java:163) ~[?:?]
		at java.io.FilterInputStream.close(FilterInputStream.java:179) ~[?:?]
		at software.amazon.awssdk.core.io.SdkFilterInputStream.close(SdkFilterInputStream.java:83) ~[?:?]
		at org.opensearch.repositories.s3.S3RetryingInputStream.close(S3RetryingInputStream.java:243) ~[?:?]
		at org.opensearch.index.store.RemoteIndexInput.close(RemoteIndexInput.java:66) ~[opensearch-2.10.0.jar:2.10.0]
		at org.apache.lucene.store.Directory.copyFrom(Directory.java:180) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
		at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:955) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:119) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:168) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-core-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:164) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:494) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:480) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-25T04:38:33,249][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-197.us-west-2.compute.internal] [shardId 39] [replication id 344584] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=98, FILE_DIFF=2, REPLICATING=0}

org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:511) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:171) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:494) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:480) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Caused by: org.opensearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
	at org.opensearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:109) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:101) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.10.0.jar:2.10.0]
	... 12 more

Caused by: java.util.concurrent.ExecutionException: java.nio.file.FileAlreadyExistsException: /home/ec2-user/opensearch/data/nodes/0/indices/qSnoT1VTSiuipllqbCJfTA/39/index/_1xb.cfe
	at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:260) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.10.0.jar:2.10.0]
	... 12 more

Caused by: java.nio.file.FileAlreadyExistsException: /home/ec2-user/opensearch/data/nodes/0/indices/qSnoT1VTSiuipllqbCJfTA/39/index/_1xb.cfe
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:94) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218) ~[?:?]
	at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484) ~[?:?]
	at java.nio.file.Files.newOutputStream(Files.java:228) ~[?:?]
	at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:394) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:387) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:220) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:153) ~[opensearch-2.10.0.jar:2.10.0]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.store.Directory.copyFrom(Directory.java:181) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:955) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:119) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:168) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-core-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) ~[opensearch-2.10.0.jar:2.10.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:164) ~[opensearch-2.10.0.jar:2.10.0]
	... 7 more
	Suppressed: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 495; received: 0)
		at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[?:?]
		at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198) ~[?:?]
		at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101) ~[?:?]
		at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142) ~[?:?]
		at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228) ~[?:?]
		at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172) ~[?:?]
		at java.io.FilterInputStream.close(FilterInputStream.java:179) ~[?:?]
		at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.close(ChecksumValidatingInputStream.java:163) ~[?:?]
		at java.io.FilterInputStream.close(FilterInputStream.java:179) ~[?:?]
		at software.amazon.awssdk.core.io.SdkFilterInputStream.close(SdkFilterInputStream.java:83) ~[?:?]
		at org.opensearch.repositories.s3.S3RetryingInputStream.close(S3RetryingInputStream.java:243) ~[?:?]
		at org.opensearch.index.store.RemoteIndexInput.close(RemoteIndexInput.java:66) ~[opensearch-2.10.0.jar:2.10.0]
		at org.apache.lucene.store.Directory.copyFrom(Directory.java:180) ~[lucene-core-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
		at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:955) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:119) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:168) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-core-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) ~[opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:164) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:494) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:480) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
		at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
		at java.lang.Thread.run(Thread.java:833) [?:?]

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):
OpenSearch version: 2.10.0,
build_date : 2023-08-17T01:55:16.349778010Z
Test scenario: 10 nodes, 40 shards, 1 replica, 2 GB shard
The command for deploying the cluster:

cdk deploy "*" --require-approval never \
-c securityDisabled=true -c minDistribution=true -c region=us-west-2 \
-c distributionUrl='https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.10.0/latest/linux/x64/tar/builds/opensearch/dist/opensearch-min-2.10.0-linux-x64.tar.gz' \
-c cpuArch='x64' -c singleNodeCluster=false \
-c dataNodeCount=10 -c dataNodeStorage=210 -c distVersion=2.10.0 -c serverAccessType=prefixList -c restrictServerAccessTo=pl-f8a64391 \
-c additionalConfig='{ "opensearch.experimental.feature.segment_replication_experimental.enabled": true, "cluster.indices.replication.strategy": "SEGMENT", "opensearch.experimental.feature.remote_store.enabled": true, "cluster.remote_store.enabled": true, "s3.client.default.endpoint": "s3.us-west-2.amazonaws.com" }' \
-c vpcId='vpc-0648c0d077c3ea997' -c securityGroupId='sg-0d1ace406e4977a79' -c suffix='10nodes-1600gb-0132' \
-c use50PercentHeap=true -c enableRemoteStore=true -c dataInstanceType=m5.xlarge \
-c storageVolumeType=gp3

The command for generating the 1600gb workload data:
expand-data-corpus.py --corpus-size 1600 --output-file-suffix 1600gb
The command for triggering the benchmark test:

opensearch-benchmark execute-test --workload=http_logs \
--pipeline=benchmark-only --target-hosts=opens-clust-xxx.elb.us-west-2.amazonaws.com:80 \
--workload-params='{"index_settings":{"number_of_shards": 40, "number_of_replicas": 1 }, "generated_corpus": "t"}' \
--include-tasks=delete-index,create-index,check-cluster-health,index-append \
--telemetry=node-stats,segment-replication-stats \
--user-tag="replication_type:segment,remote_store:enabled,node_count:10,shard_count:40,replica_count:1,shard_size_in_gb:2"

Additional context
There was an issue reporting a similar symptom #8905

@tlfeng tlfeng added bug Something isn't working untriaged labels Aug 25, 2023
@dreamer-89 dreamer-89 added Indexing:Replication Issues and PRs related to core replication framework eg segrep distributed framework Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Aug 25, 2023
@dreamer-89 dreamer-89 self-assigned this Aug 29, 2023
@dreamer-89
Copy link
Member

Looking into it

@tlfeng
Copy link
Collaborator Author

tlfeng commented Aug 29, 2023

More detail, @mch2 modified the code to catch the exception tlfeng@0849ff0, then I re-run the tests after deploying the modified OpenSearch to the cluster.

Finally the cluster had "cluster-manager not discovered exception" instead of red index.
Found the following warning and error:
[2023-08-26T06:23:06,697][WARN ][o.o.g.G.InternalReplicaShardAllocator] [ip-10-0-3-211.us-west-2.compute.internal] [logs-5000gb][1]: failed to list shard for shard_store on node [rL7b76fCR_CE9lOl-dvuPA]

@timestamp,@message
2023-08-26 06:12:15.980,"[2023-08-26T06:12:15,758][ERROR][o.o.c.a.s.ShardStateAction] [ip-10-0-3-211.us-west-2.compute.internal] [logs-5000gb][7] unexpected failure while failing shard [shard id [[logs-5000gb][7]], allocation id [s3i6cU7WTg229ikP9eAkkQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [logs-5000gb][7], node[cHjaxWfZSLq1CeMZfCcH0A], [R], s[STARTED], a[id=s3i6cU7WTg229ikP9eAkkQ]], failure [RemoteTransportException[[ip-10-0-4-246.us-west-2.compute.internal][10.0.4.246:9300][indices:data/write/bulk[s][validate_primary_term]]]; nested: IllegalStateException[[logs-5000gb][7] operation primary term [1] is too old (current [2])]; ], markAsStale [true]]"
2023-08-26 01:16:31.989,"[2023-08-26T01:16:31,762][ERROR][o.o.c.a.s.ShardStateAction] [seed] [logs-5000gb][5] unexpected failure while failing shard [shard id [[logs-5000gb][5]], allocation id [Xxjs8XCjRm-PlE11R7alsw], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [logs-5000gb][5], node[3B8PNposTMCZuxENJUrOdg], [R], s[STARTED], a[id=Xxjs8XCjRm-PlE11R7alsw]], failure [RemoteTransportException[[ip-10-0-3-153.us-west-2.compute.internal][10.0.3.153:9300][indices:data/write/bulk[s][validate_primary_term]]]; nested: IllegalStateException[[logs-5000gb][5] operation primary term [1] is too old (current [2])]; ], markAsStale [true]]"
2023-08-25 23:59:51.258,"[2023-08-25T23:59:51,033][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-3-153.us-west-2.compute.internal] [shardId [logs-5000gb][8]] [replication id 47167] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=65, FILE_DIFF=4, REPLICATING=0}"
2023-08-25 23:52:11.443,"[2023-08-25T23:52:11,397][ERROR][o.o.i.s.IndexShard       ] [ip-10-0-5-34.us-west-2.compute.internal] [logs-5000gb][1] skip local recovery as failed to find the safe commit"
2023-08-25 23:52:06.434,"[2023-08-25T23:52:06,245][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-34.us-west-2.compute.internal] [shardId [logs-5000gb][1]] [replication id 42379] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=88, FILE_DIFF=4, REPLICATING=0, GET_FILES=95}"
2023-08-25 23:52:06.434,"[2023-08-25T23:52:06,240][ERROR][o.o.i.r.RemoteStoreReplicationSource] [ip-10-0-5-34.us-west-2.compute.internal] File already exists?"
2023-08-25 23:52:06.183,"[2023-08-25T23:52:06,051][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-34.us-west-2.compute.internal] [shardId [logs-5000gb][1]] [replication id 42376] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=70, FILE_DIFF=5, REPLICATING=0}"
2023-08-25 23:43:02.184,"[2023-08-25T23:43:02,174][ERROR][o.o.i.s.IndexShard       ] [ip-10-0-5-34.us-west-2.compute.internal] [logs-5000gb][8] skip local recovery as failed to find the safe commit"
2023-08-25 23:42:54.819,"[2023-08-25T23:42:54,804][ERROR][o.o.i.s.IndexShard       ] [ip-10-0-4-14.us-west-2.compute.internal] [logs-5000gb][5] skip local recovery as failed to find the safe commit"
2023-08-25 23:41:53.717,"[2023-08-25T23:41:53,550][ERROR][o.o.i.s.IndexShard       ] [ip-10-0-4-14.us-west-2.compute.internal] [logs-5000gb][6] skip local recovery as failed to find the safe commit"
2023-08-25 23:41:23.416,"[2023-08-25T23:41:23,250][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-4-14.us-west-2.compute.internal] [shardId [logs-5000gb][5]] [replication id 41100] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=84, FILE_DIFF=3, REPLICATING=0, GET_FILES=112}"
2023-08-25 23:41:23.416,"[2023-08-25T23:41:23,245][ERROR][o.o.i.r.RemoteStoreReplicationSource] [ip-10-0-4-14.us-west-2.compute.internal] File already exists?"
2023-08-25 23:41:23.199,"[2023-08-25T23:41:23,046][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-4-14.us-west-2.compute.internal] [shardId [logs-5000gb][5]] [replication id 41097] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=109, FILE_DIFF=4, REPLICATING=0}"
2023-08-25 23:40:16.154,"[2023-08-25T23:40:15,908][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-34.us-west-2.compute.internal] [shardId [logs-5000gb][8]] [replication id 39920] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=81, FILE_DIFF=5, REPLICATING=0, GET_FILES=104}"
2023-08-25 23:40:16.154,"[2023-08-25T23:40:15,904][ERROR][o.o.i.r.RemoteStoreReplicationSource] [ip-10-0-5-34.us-west-2.compute.internal] File already exists?"
2023-08-25 23:40:15.903,"[2023-08-25T23:40:15,713][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-5-34.us-west-2.compute.internal] [shardId [logs-5000gb][8]] [replication id 39914] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=90, FILE_DIFF=3, REPLICATING=0}"
2023-08-25 23:40:14.651,"[2023-08-25T23:40:14,458][ERROR][o.o.i.s.IndexShard       ] [ip-10-0-5-34.us-west-2.compute.internal] [logs-5000gb][9] skip local recovery as failed to find the safe commit"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing v2.10.0
Projects
None yet
2 participants