Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] Messages lost on the remote cluster when using topic level replication #22890

Merged
merged 7 commits into from
Jun 19, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jun 11, 2024

Motivation

Background

  1. the method PersistentTopic.addReplicationCluster will remove orphan replication cursors, see [fix][broker] Ignore and remove the replicator cursor when the remote cluster is absent #19972
  2. when a topic is loading up, it will initialize replicators by the cursors named pulsar.repl.<remote cluster>

Issue

  • Mechanism 2 was called before topic-level policies initializing, so it will remove the replicator and cursor when a topic is loading up.
  • After the topic level policies are initialized, the broker will create a new replicator with a new cursor(start at the latest position), and all the messages in backlog are lost.

This issue may cause messages to be lost when using topic-level Geo-Replication. For example:

  • Enable topic level Geo-Replication
  • Send messages on this topic
    • There are 3 ledgers under the topic: {L1: 0~50000, L2: 0~50000, L3: 0~5000}
    • The replication cursor came to L1: 10000 now.
  • Unload the topic.
    • The Replication cursor was removed due to the remote cluster is not in the replicated cluster list(the topic level policies have not been loaded yet).
    • Create a new replication cursor after the topic-level policies are loaded. But the new cursor will start at the latest position(L3: 50001)
  • The messages L1: 10001 ~ L3: 50000 were lost

Modifications

Fix the issue: make Mechanism 2 being called after the topic-level policies were loaded.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/2.11.5 release/3.3.1 release/3.0.6 labels Jun 11, 2024
@poorbarcode poorbarcode added this to the 3.4.0 milestone Jun 11, 2024
@poorbarcode poorbarcode self-assigned this Jun 11, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 11, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some questions.

Since this issue is related to topic policies, I'd like to bring up to attention also issue #21303. Topic policy mutations aren't thread safe and just wondering if there are cases that could lead to replication clusters being removed if the policies get mutated in a way where replication clusters are lost. Do we have plans to address #21303?

@poorbarcode poorbarcode requested a review from lhotari June 12, 2024 02:53
@nodece
Copy link
Member

nodece commented Jun 12, 2024

Related to #22504, I think we need to wait for the ns and topic policies to load before loading the replicator.

BTW, the ns and topic policies are asynchronous, and we also need to use a single thread for serial updates. #21303 should be fixed first before we can handle GEO replication correctly.

@poorbarcode
Copy link
Contributor Author

@nodece

BTW, the ns and topic policies are asynchronous, and we also need to use a single thread for serial updates. #21303 should be fixed first before we can handle GEO replication correctly.

Correct, that is what the current PR did

@poorbarcode poorbarcode requested a review from nodece June 13, 2024 07:43
@poorbarcode poorbarcode requested a review from hanmz June 18, 2024 01:54
@poorbarcode poorbarcode merged commit feae589 into apache:master Jun 19, 2024
48 of 51 checks passed
@poorbarcode poorbarcode deleted the fix/replication_data_lost branch June 19, 2024 07:40
poorbarcode added a commit that referenced this pull request Jun 19, 2024
poorbarcode added a commit that referenced this pull request Jun 19, 2024
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 21, 2024
…evel replication (apache#22890)

(cherry picked from commit feae589)
(cherry picked from commit 3f28fa3)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 24, 2024
…evel replication (apache#22890)

(cherry picked from commit feae589)
(cherry picked from commit 3f28fa3)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 25, 2024
…evel replication (apache#22890)

(cherry picked from commit feae589)
(cherry picked from commit 3f28fa3)
lhotari pushed a commit that referenced this pull request Jun 25, 2024
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jul 1, 2024
…evel replication (apache#22890)

(cherry picked from commit feae589)
(cherry picked from commit 3f28fa3)
nodece pushed a commit to ascentstream/pulsar that referenced this pull request Sep 6, 2024
…evel replication (apache#22890)

(cherry picked from commit feae589)
Signed-off-by: Zixuan Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-3.0 cherry-picked/branch-3.2 cherry-picked/branch-3.3 doc-not-needed Your PR changes do not impact docs release/3.0.6 release/3.2.4 release/3.3.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants