Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [client] Messages lost when consumer reconnect #20591

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jun 16, 2023

Motivation

Background of consumer reconnects

  • grab connection:
    • lookup the broker which owned the topic
    • get the existing connection by the broker; create one if it does not exist
  • clear the messages in memory
  • send CMD-subscribe to the broker
  • send flow permits to broker to increment availablePermits

Issue

If the consumer tries to reconnect[1] multi times, it will lose some messages due to a race condition, for example:

time reconnect-1 reconnect-2
1 Grab connection
2 Grab connection
3 Clear messages in memory
4 Do subscribe and increase availablePermits
5 receive 712 messages
6 Clear messages in memory. (Highlight) 712 messages were lost
7 Do subscribe and increase availablePermits(Since a subscription is there, so broker just response success)

We can see that the variable availablePermits of all the stuck consumers is large than 1000(by default, the max value is 1000), and ${availablePermits} + ${msgOutCounter} is an integer multiple of 1000. It means the broker received more than one CMD-flow of the same consumer[2].


Why does reconnection execute more than once: there are these scenarios that could trigger reconnection:


Modifications

Discard the task subscribe if there is an in-flight subscribe

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Jun 16, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 16, 2023
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug release/2.10.5 release/2.11.2 release/3.0.2 area/client and removed doc-not-needed Your PR changes do not impact docs labels Jun 16, 2023
@poorbarcode poorbarcode added this to the 3.1.0 milestone Jun 16, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 16, 2023
@poorbarcode poorbarcode reopened this Jun 16, 2023
admin.topics().createSubscription(topicName, subscriptionName, MessageId.earliest);
// Create producer and consumer.
ConsumerImpl<String> consumer = (ConsumerImpl<String>) pulsarClient.newConsumer(Schema.STRING)
.subscriptionType(SubscriptionType.Shared)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use Shared subcription?

Copy link
Contributor Author

@poorbarcode poorbarcode Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use Shared subcription?

Just to make this test exactly the same as the example in the motivation( if useExclusive, pulsar will not record the unackMessages of the consumers). see: https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Consumer.java#L1067

(Highlight) Already add all subscription types by the parameter provider.

Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question that in which case could a consumer grab connections concurrently?

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Jun 16, 2023
### Motivation

In `ConnectionHandler`, there is a `Connection` interface whose methods
will be called after the connection in `grabCnx` is established, the
implementation of `Connection` might send some requests after the
connection is established. For example, the consumer will send the
`CommandSubscribe` request in `connectionOpened`. However, the whole
process is not atomic, which leads to the message lost reported in
apache#20591.

### Modifications

Modify the `Connection` interface to have a single method:

```java
CompletableFuture<Void> handleNewConnection(ClientCnx cnx, PulsarClientException e);
```

The returned future should be completed once the implementation has done
everything, e.g. for the consumer, the future should only be completed
after receiving the response for `CommandSubscribe`.

In `grabCnx`, the `ConnectionHandler` could only connect to the broker
once the whole process is completed.

Add `ConnectionHandlerTest` to verify the behavior.
Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good way to fix the message lost issue. The unit test never simulates the actual case. It only tests calling consumer.connectionOpened.

Each reconnection should be:

  1. ConnectionHandler get the connection asynchronously via PulsarClientImpl#getConnection
  2. The Connection implementation (e.g. the ConsumerImpl here) send some other requests asynchronously.

We should make the whole process atomic. Otherwise, similar issues might happen for other Connection implementations like ProducerImpl.

I have opened a PR to thoroughly fix the issue: #20595

@poorbarcode
Copy link
Contributor Author

Since @BewareMyPower created a new PR to use a good implement to fix this issue, Just like I said in the second step of optimization #20591 (comment), I closed this PR

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Jun 29, 2023
### Motivation

In `ConnectionHandler`, there is a `Connection` interface whose methods
will be called after the connection in `grabCnx` is established, the
implementation of `Connection` might send some requests after the
connection is established. For example, the consumer will send the
`CommandSubscribe` request in `connectionOpened`. However, the whole
process is not atomic, which leads to the message lost reported in
apache#20591.

### Modifications

Modify the `Connection` interface to have a single method:

```java
CompletableFuture<Void> handleNewConnection(ClientCnx cnx, PulsarClientException e);
```

The returned future should be completed once the implementation has done
everything, e.g. for the consumer, the future should only be completed
after receiving the response for `CommandSubscribe`.

In `grabCnx`, the `ConnectionHandler` could only connect to the broker
once the whole process is completed.

Add `ConnectionHandlerTest` to verify the behavior.
@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jun 30, 2023

Since the PR #20595 does not fix the issue "Messages lost when consumer reconnect", I reopen this PR. see the comment: #20595 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client doc-not-needed Your PR changes do not impact docs ready-to-test release/2.10.5 release/2.11.2 release/3.0.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants