Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix one potential to add duplicated consumer #20583

Conversation

TakaHiR07
Copy link
Contributor

@TakaHiR07 TakaHiR07 commented Jun 15, 2023

Fixes #20576

Motivation

There is a case resulted in duplicated consumer, if we add the same consumer twice continuously, then close consumer, the offline consumer would remain in consumerList, because AbstractDispatcherMultipleConsumers#consumerList is arrayList and AbstractDispatcherMultipleConsumers#consumerSet is set.

There is a concurrent problem in ServerCnx#handleSubscribe, which cause this case. Suppose client request handleSubscribe() three times and request handleCloseConsumer(). The execute order is as follow. since the order is addConsumer->addConsumer->close, but not addConsumer->close->addConsumer, duplicated consumer problem occur.

企业微信截图_1ea4ed11-3c2b-4e46-94a8-aad1be312b93

Modifications

  1. remove the step 9 "consumersLongHashmap.remove consumerFuture3".
  • Because if judge existingConsumerFuture is not null, it means there is an earlier request try to handleSubscribe. the remove operation should be executed in the earlier request.
  1. add a test to verify this concurrent problem and the fix

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: TakaHiR07#10

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 15, 2023
@eolivelli
Copy link
Contributor

You say in the linked issue and you have seen this twice.
Are you able to reproduce it manually and verify that with this fix the problem is really fixed?

@TakaHiR07
Copy link
Contributor Author

You say in the linked issue and you have seen this twice. Are you able to reproduce it manually and verify that with this fix the problem is really fixed?

I guess this is one of the possibilities for duplicated consumer. And this pr is corresponding to our production case. I can't affirm the duplicated consumer problem do not occur anymore.

@TakaHiR07
Copy link
Contributor Author

@poorbarcode @eolivelli @lhotari @congbobo184 Can you help review this pr? It is another possibilities besides #15051

@@ -1184,7 +1184,6 @@ protected void handleSubscribe(final CommandSubscribe subscribe) {
ServerError error = getErrorCodeWithErrorLog(existingConsumerFuture, true,
String.format("Consumer subscribe failure. remoteAddress: %s, subscription: %s",
remoteAddress, subscriptionName));
consumers.remove(consumerId, existingConsumerFuture);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three clients in your example, the pulsar client maintains its own connection pool, which means the client1 will not use the same connection as client2 or client3.

The code you changed is the code of ServerCnx(this object is not shared across multi clients), so it will not affect other consumers of other clients

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional background:

If you have two pulsar clients, the consumers registered will be like this:

  • pulsar-client-1
    • connection-1 (one-to-one relationship with ServerCnx)
      • consumer-1
      • consumer-2
    • connection-2
      • consumer-3
      • consumer-4
  • pulsar-client-2
    • connection-3
      • consumer-5
      • consumer-6
    • connection-4
      • consumer-7
      • consumer-8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only consumer-1 and consumer-2 can be in conflict since their consumer-id is not the same, so all things are OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is one client request three times. Because in our server log, "Subscribing on topic" occur three times in a short time from the same remoteAddress host:ip. Therefore they are in the same connection ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is one client request three times. Because in our server log, "Subscribing on topic" occur three times in a short time from the same remoteAddress host:ip. Therefore they are in the same connection ?

could you provide the logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the log, we can not confirm if these three consumers are the same, because we can not know their consumer id. This log was been improved in the PR https://github.com/apache/pulsar/pull/20568/files#diff-1e0e8195fb5ec5a6d79acbc7d859c025a9b711f94e6ab37c94439e99b3202e84R1168

Copy link
Contributor

@poorbarcode poorbarcode Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time req-1 req-2 close connection
1 put a new consumerFuture
2 close connection
3 mark the consumerFuture as failed
4 close the connection
5 got consumerFuture(failed)
6 remove the failed future
7 add the consumer into the list and set(list.size = 1 and set.size = 1)
8 put the second consumerFuture
9 add the second consumer into the list and set(list.size = 2 and set.size = 1)
10 remove the consumer from the list and set(list.size = 1 and set.size = 0)

Do you want to say the three requests executed as above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there is a PR trying to fix the concurrent call subscribe in the same client

I still think the current PR is trying to solve this issue is meaningful, and it would be nice to have a test that can reproduce it.

Copy link
Contributor

@poorbarcode poorbarcode Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TakaHiR07

Could you take a look at this PR #22283 , #22270 and modify the current one? If yes, I will close #22283, #22270 ❤️

@TakaHiR07 TakaHiR07 force-pushed the fix_potential_to_add_duplicated_consumer branch from 7bbfba5 to e4f561b Compare June 25, 2023 12:46
@TakaHiR07 TakaHiR07 force-pushed the fix_potential_to_add_duplicated_consumer branch from e4f561b to fbebced Compare June 25, 2023 12:48
@TakaHiR07
Copy link
Contributor Author

TakaHiR07 commented Jun 25, 2023

Add a test to verify this concurrent case and the fix. PTAL, thanks. @codelipenghui @eolivelli @lhotari @poorbarcode

@Technoboy- Technoboy- added this to the 3.2.0 milestone Jul 31, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Aug 31, 2023
@codelipenghui codelipenghui added the category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost label Sep 19, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@graysonzeng
Copy link
Contributor

graysonzeng commented Jan 31, 2024

I had the same problem on flink task and got fixed after trying this PR.

@TakaHiR07
Copy link
Contributor Author

@poorbarcode @codelipenghui It seems this problem is not fixed in the master branch.

@TakaHiR07
Copy link
Contributor Author

#22283 has fixed the same problem. Close this pr.

@TakaHiR07 TakaHiR07 closed this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost doc-not-needed Your PR changes do not impact docs Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] [broker] subscription of topic in a permanent fence state and consumerList remain offline consumer
7 participants