Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328

Merged
merged 2 commits into from
Oct 16, 2023

Conversation

shustsud
Copy link
Contributor

Related Issue: #235

Motivation

A potential double scheduling of reconnection due to a broker shutdown was observed.

The reconnect can be scheduled with either of the following codes
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209
or
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121

If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from cnx:
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63
--> https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217

The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag reconnectionPending_ has been introduced via #310 .

However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after #310(I tried with b35ae1a):

# Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled.
...
2023-09-26 15:42:05.736 INFO  [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5
2023-09-26 15:42:05.736 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s
...

# Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again.
...
2023-09-26 15:42:05.836 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
2023-09-26 15:42:05.837 WARN  [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16
2023-09-26 15:42:05.837 WARN  [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable
2023-09-26 15:42:05.837 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s
...

# During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled.
# After that, two subscribe requests are sent to broker2.
2023-09-26 15:42:06.034 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError
2023-09-26 15:42:06.515 INFO  [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot>
2023-09-26 15:42:06.515 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s
...
2023-09-26 15:42:06.907 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.912 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] 
...
2023-09-26 15:42:07.103 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] 
...

To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.

Modifications

The timing for setting reconnectionPending_ to false has been changed.

@nkurihar nkurihar added the bug Something isn't working label Oct 13, 2023
Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It's similar to my fix for the Java client: apache/pulsar#20595

I just left two comments at the moment and will review it again when I'm back to work next week.

lib/Future.h Outdated
@@ -138,6 +138,8 @@ class Promise {

bool setFailed(Result result) const { return state_->complete(result, {}); }

bool setSuccess() const { return state_->complete({}, {}); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace this method with a setValue({}) call or passing a static variable to setValue.

connectionOpened(cnx).addListener([this, self](Result result, bool) {
// Do not use bool, only Result.
reconnectionPending_ = false;
if (result == ResultRetryable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use isResultRetryable() from ResultUtil.h here. Though I admit using two Result enums as retryable enums seems weird.

@BewareMyPower BewareMyPower added this to the 3.4.0 milestone Oct 16, 2023
@shustsud
Copy link
Contributor Author

@BewareMyPower
Thank you for your comments.
I addressed your comments.

@BewareMyPower BewareMyPower merged commit ebae92e into apache:main Oct 16, 2023
11 checks passed
BewareMyPower pushed a commit to streamnative/pulsar-client-cpp that referenced this pull request Oct 20, 2023
…ble attempt at reconnecting (apache#328)

Related Issue: apache#235

### Motivation
A potential double scheduling of reconnection due to a broker shutdown was observed.

The reconnect can be scheduled with either of the following codes
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209)
or
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350)
-> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121)

If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`:
[ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285)
-> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63)
--> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217)

The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag `reconnectionPending_` has been introduced via apache#310 .

However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after apache#310(I tried with apache@b35ae1a):
```
# Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled.
...
2023-09-26 15:42:05.736 INFO  [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5
2023-09-26 15:42:05.736 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s
...

# Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again.
...
2023-09-26 15:42:05.836 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
2023-09-26 15:42:05.837 WARN  [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16
2023-09-26 15:42:05.837 WARN  [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable
2023-09-26 15:42:05.837 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s
...

# During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled.
# After that, two subscribe requests are sent to broker2.
2023-09-26 15:42:06.034 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError
2023-09-26 15:42:06.515 INFO  [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot>
2023-09-26 15:42:06.515 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s
...
2023-09-26 15:42:06.907 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.912 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>]
...
2023-09-26 15:42:07.103 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>]
...
```

To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.

### Modifications
The timing for setting `reconnectionPending_` to false has been changed.

(cherry picked from commit ebae92e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants