Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][client] Make the whole grabCnx() progress atomic #20595

Merged
merged 2 commits into from
Jun 30, 2023

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Jun 16, 2023

Motivation

In ConnectionHandler, there is a Connection interface whose methods will be called after the connection in grabCnx is established, the implementation of Connection might send some requests after the connection is established. For example, the consumer will send the CommandSubscribe request in connectionOpened. However, the whole process is not atomic, which leads to the message lost reported in #20591.

Modifications

Modify the Connection#connectionOpened interface to return a CompletableFuture.

The returned future should be completed once the implementation has done everything, e.g. for the consumer, the future should only be completed after receiving the response for CommandSubscribe.

In grabCnx, the ConnectionHandler could only connect to the broker once the whole process is completed.

Add ConnectionHandlerTest to verify the behavior.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: BewareMyPower#29

@BewareMyPower
Copy link
Contributor Author

With this patch, we don't need any synchronization when implementing handleNewConnection (the combination of connectionOpened and connectionFailed) in future.

@BewareMyPower BewareMyPower marked this pull request as draft June 16, 2023 16:26
@BewareMyPower
Copy link
Contributor Author

There are some failed tests, I will fix them ASAP.

@BewareMyPower BewareMyPower marked this pull request as ready for review June 18, 2023 13:39
@BewareMyPower BewareMyPower marked this pull request as draft June 19, 2023 07:24
@BewareMyPower BewareMyPower marked this pull request as ready for review June 19, 2023 08:33
@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@BewareMyPower
Copy link
Contributor Author

PTAL @codelipenghui @poorbarcode

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Technoboy-
Technoboy- previously approved these changes Jun 27, 2023
@BewareMyPower
Copy link
Contributor Author

I made new changes to avoid race condition described in #20595 (comment)

The root cause is that for a grabCnx operation, duringConnect.set(false) could be called twice and reconnectLater could be called by the connection implementation. So I removed the reconnectLater calls in the Connection#connectionOpened implementations and just complete the future exceptionally so that reconnectLater will be called later.

            cnxFuture.thenCompose(cnx -> connection.connectionOpened(cnx))
                    .thenAccept(__ -> duringConnect.set(false))
                    .exceptionally(this::handleConnectionError);

All possible cases are:

  1. cnxFuture completes with null
    1.1 connectionOpened completes with null, thenAccept will set duringConnect with false.
    1.2 connectionOpened completes exceptionally, exceptionally will call reconnectLater, which sets duringConnect with false.
  2. cnxFuture completes exceptionally, exceptionally will call reconnectLater, which sets duringConnect with false.

Ideally, reconnectLater should be private, but it seems hard to remove the reconnectLater calls in TransactionMetaStoreHandler.

PTAL again. @poorbarcode @lifepuzzlefun @tisonkun @Technoboy-

### Motivation

In `ConnectionHandler`, there is a `Connection` interface whose methods
will be called after the connection in `grabCnx` is established, the
implementation of `Connection` might send some requests after the
connection is established. For example, the consumer will send the
`CommandSubscribe` request in `connectionOpened`. However, the whole
process is not atomic, which leads to the message lost reported in
apache#20591.

### Modifications

Modify the `Connection` interface to have a single method:

```java
CompletableFuture<Void> handleNewConnection(ClientCnx cnx, PulsarClientException e);
```

The returned future should be completed once the implementation has done
everything, e.g. for the consumer, the future should only be completed
after receiving the response for `CommandSubscribe`.

In `grabCnx`, the `ConnectionHandler` could only connect to the broker
once the whole process is completed.

Add `ConnectionHandlerTest` to verify the behavior.
Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. LGTM.

When there are some reviewers you may try to do accumulative commits instead of force push; thus it's easy to review the changeset compared to the last review..

You can "merge master" instead of rebase and the final "squash and merge" would squash everything into one commit for you.

@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Jun 29, 2023

When there are some reviewers you may try to do accumulative commits instead of force push; thus it's easy to review the changeset compared to the last review..

I know it so I didn't push it by force before. But since there were many new commits from the approved commit before, I think it might be better to review for a completely new commit.

@BewareMyPower BewareMyPower dismissed Technoboy-’s stale review June 29, 2023 16:29

The PR changed a lot

@BewareMyPower
Copy link
Contributor Author

The failed test testMsgRateExpired will be fixed by #20629

@codecov-commenter
Copy link

Codecov Report

Merging #20595 (48540d0) into master (0e60340) will increase coverage by 0.53%.
The diff coverage is 77.85%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #20595      +/-   ##
============================================
+ Coverage     72.60%   73.13%   +0.53%     
- Complexity    32018    32076      +58     
============================================
  Files          1855     1869      +14     
  Lines        138569   138883     +314     
  Branches      15250    15273      +23     
============================================
+ Hits         100605   101578     +973     
+ Misses        29945    29262     -683     
- Partials       8019     8043      +24     
Flag Coverage Δ
inttests 24.11% <41.42%> (+<0.01%) ⬆️
systests 25.07% <25.71%> (?)
unittests 72.41% <77.50%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ava/org/apache/pulsar/admin/cli/CmdNamespaces.java 76.17% <0.00%> (-0.08%) ⬇️
...ulsar/client/impl/TransactionMetaStoreHandler.java 68.44% <44.44%> (-0.03%) ⬇️
...rg/apache/pulsar/client/impl/TopicListWatcher.java 65.21% <46.15%> (+0.82%) ⬆️
.../service/SystemTopicBasedTopicPoliciesService.java 76.48% <50.00%> (-1.19%) ⬇️
...ar/common/naming/TopicBundleAssignmentFactory.java 66.66% <66.66%> (ø)
...n/java/org/apache/pulsar/admin/cli/CliCommand.java 57.95% <71.42%> (-1.60%) ⬇️
...g/apache/pulsar/client/impl/ConnectionHandler.java 90.47% <76.19%> (+3.24%) ⬆️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 82.76% <80.00%> (+0.59%) ⬆️
.../metadata/bookkeeper/PulsarRegistrationClient.java 80.85% <81.11%> (-4.11%) ⬇️
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 77.25% <84.61%> (+0.75%) ⬆️
... and 10 more

... and 155 files with indirect coverage changes

@tisonkun
Copy link
Member

@poorbarcode Thanks for your update.

Merging...

Then you can rebase your PR onto this one :D

@liangyepianzhou
Copy link
Contributor

liangyepianzhou commented Jul 7, 2023

@BewareMyPower, the TopicListWatcher.java is not included in the branch 2.10. So this PR does not need to cherry-pick to branch 2.10, right?

@liangyepianzhou
Copy link
Contributor

@BewareMyPower, the TopicListWatcher.java is not included in the branch 2.10. So this PR does not need to cherry-pick to branch 2.10, right?

Oh, my fault. We can cherry-pick this PR to branch 2.10. TopicListWatcher.java is not part of the core modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants