Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating is flaky #8928

Closed
kotwanikunal opened this issue Jul 27, 2023 · 10 comments · Fixed by #8937, #10660 or #13248
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Remote Search >test-failure Test failure from CI, local build, etc.

Comments

@kotwanikunal
Copy link
Member

Describe the bug

To Reproduce

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating" -Dtests.seed=D8312943A799670E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-AE -Dtests.timezone=America/Manaus -Druntime.java=20

Expected behavior

  • Test to pass
@kotwanikunal kotwanikunal added bug Something isn't working untriaged labels Jul 27, 2023
@kotwanikunal kotwanikunal added flaky-test Random test failure that succeeds on second run and removed untriaged labels Jul 27, 2023
@dreamer-89 dreamer-89 added the >test-failure Test failure from CI, local build, etc. label Jul 27, 2023
@dreamer-89 dreamer-89 self-assigned this Jul 27, 2023
@dreamer-89
Copy link
Member

Tried with given seed, but test passes locally. Running test in repeat mode until failure.

@dreamer-89
Copy link
Member

No failure observed in 10k runs with given seed.

Screenshot 2023-07-27 at 10 26 27 AM

Next steps: Running test class on repeat

@dreamer-89
Copy link
Member

From failure stack trace , the test fails on Assertion here

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=486, name=opensearch[org.opensearch.indices.replication.SegmentReplicationTargetServiceTests][generic][T#2], state=RUNNABLE, group=TGRP-SegmentReplicationTargetServiceTests]
	at __randomizedtesting.SeedInfo.seed([D3ADB03B1A04B5E3:F074352090C37CEA]:0)
Caused by: java.lang.AssertionError: Should not succeed

There are no logs which can help debugging this failure.

[2023-07-25T15:44:46,986][INFO ][o.o.t.TransportService   ] [testShardAlreadyReplicating] Remote clusters initialized successfully.
[2023-07-25T15:44:46,994][INFO ][o.o.i.r.SegmentReplicationTargetServiceTests] [testShardAlreadyReplicating] before test
[2023-07-25T15:44:47,015][INFO ][o.o.i.r.SegmentReplicationTargetServiceTests] [testShardAlreadyReplicating] after test

@dreamer-89
Copy link
Member

Not much success with running tests at class level.
Screenshot 2023-07-27 at 11 59 09 AM

@dreamer-89
Copy link
Member

dreamer-89 commented Jul 27, 2023

On doing a closer look on the test, it assumes the first round of segment replication is active for the test to pass. This assumption may not hold everytime and it is possible that first round of segment replication completes before control goes to start another round. An easy way to reproduce this issue locally is to add a wait post first startReplication invocation. Test fails reliably after adding a 10ms wait time.

    public void testShardAlreadyReplicating() {
   1 -->     sut.startReplication(replicaShard, mock(SegmentReplicationTargetService.SegmentReplicationListener.class));
   2 -->    sut.startReplication(replicaShard, new SegmentReplicationTargetService.SegmentReplicationListener() {
            @Override
            public void onReplicationDone(SegmentReplicationState state) {
                Assert.fail("Should not succeed");
            }

            @Override
            public void onReplicationFailure(SegmentReplicationState state, ReplicationFailedException e, boolean sendShardFailure) {
                assertEquals("Shard " + replicaShard.shardId() + " is already replicating", e.getMessage());
                assertFalse(sendShardFailure);
            }
        });
    }

@dreamer-89
Copy link
Member

This test is still failing, reopening

@dreamer-89
Copy link
Member

This test fails while expecting call to processLatestReceivedCheckpoint.

From https://build.ci.opensearch.org/job/gradle-check/22784

Wanted but not invoked:
segmentReplicationTargetService.processLatestReceivedCheckpoint(
    org.opensearch.index.shard.IndexShard@35f0ec9a,
    <any>
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating(SegmentReplicationTargetServiceTests.java:295)

However, there were exactly 5 interactions with this mock:
segmentReplicationTargetService.startReplication(
    org.opensearch.indices.replication.SegmentReplicationTarget$MockitoMock$CY6zJxTo@54ccd0b4
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating(SegmentReplicationTargetServiceTests.java:291)

segmentReplicationTargetService.onNewCheckpoint(
    ReplicationCheckpoint{shardId=[index][0], primaryTerm=88, segmentsGen=3, version=8, size=0, codec=Lucene95},
    org.opensearch.index.shard.IndexShard@35f0ec9a
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating(SegmentReplicationTargetServiceTests.java:294)

segmentReplicationTargetService.updateLatestReceivedCheckpoint(
    ReplicationCheckpoint{shardId=[index][0], primaryTerm=88, segmentsGen=3, version=8, size=0, codec=Lucene95},
    org.opensearch.index.shard.IndexShard@35f0ec9a
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.onNewCheckpoint(SegmentReplicationTargetService.java:216)

segmentReplicationTargetService.startReplication(
    org.opensearch.index.shard.IndexShard@35f0ec9a,
    ReplicationCheckpoint{shardId=[index][0], primaryTerm=88, segmentsGen=3, version=8, size=0, codec=Lucene95},
    org.opensearch.indices.replication.SegmentReplicationTargetService$1@4c9dfcab
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.onNewCheckpoint(SegmentReplicationTargetService.java:245)

segmentReplicationTargetService.startReplication(
    org.opensearch.indices.replication.SegmentReplicationTarget@10dfa5d1
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.startReplication(SegmentReplicationTargetService.java:419)

@Rishikesh1159
Copy link
Member

Not able to reproduce the test failure. Tried running entire test suite class but still not able to reproduce. This flaky test might be fixed from some changes happened in last 2 months. I will unmute the test for now with PR #10660 (as we are not able to repro) and I am adding trace logging with test, so next time if we see this test failing we can get more info from logs.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Segment Replication Oct 18, 2023
@reta reta reopened this Jan 20, 2024
@github-project-automation github-project-automation bot moved this from Done to In Progress in Segment Replication Jan 20, 2024
@reta
Copy link
Collaborator

reta commented Jan 20, 2024

Wanted but not invoked:
segmentReplicationTargetService.processLatestReceivedCheckpoint(
    org.opensearch.index.shard.IndexShard@4876cf44,
    <any>
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating(SegmentReplicationTargetServiceTests.java:316)

However, there were exactly 5 interactions with this mock:
segmentReplicationTargetService.startReplication(
    org.opensearch.indices.replication.SegmentReplicationTarget$MockitoMock$8gnjoQCG@b2e2a95
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating(SegmentReplicationTargetServiceTests.java:312)

segmentReplicationTargetService.onNewCheckpoint(
    ReplicationCheckpoint{shardId=[index][0], primaryTerm=95, segmentsGen=3, version=8, size=0, codec=Lucene99},
    org.opensearch.index.shard.IndexShard@4876cf44
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testShardAlreadyReplicating(SegmentReplicationTargetServiceTests.java:315)

segmentReplicationTargetService.updateLatestReceivedCheckpoint(
    ReplicationCheckpoint{shardId=[index][0], primaryTerm=95, segmentsGen=3, version=8, size=0, codec=Lucene99},
    org.opensearch.index.shard.IndexShard@4876cf44
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.onNewCheckpoint(SegmentReplicationTargetService.java:273)

segmentReplicationTargetService.startReplication(
    org.opensearch.index.shard.IndexShard@4876cf44,
    ReplicationCheckpoint{shardId=[index][0], primaryTerm=95, segmentsGen=3, version=8, size=0, codec=Lucene99},
    org.opensearch.indices.replication.SegmentReplicationTargetService$1@aa0c9f5
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.onNewCheckpoint(SegmentReplicationTargetService.java:302)

segmentReplicationTargetService.startReplication(
    org.opensearch.indices.replication.SegmentReplicationTarget@3fec1515
);
-> at org.opensearch.indices.replication.SegmentReplicationTargetService.startReplication(SegmentReplicationTargetService.java:498)

@getsaurabh02
Copy link
Member

@dreamer-89 Are we still actively looking into this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment