Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328

shustsud · 2023-10-13T07:25:13Z

Related Issue: #235

Motivation

A potential double scheduling of reconnection due to a broker shutdown was observed.

The reconnect can be scheduled with either of the following codes
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209
or
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121

If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from cnx:
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63
--> https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217

The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag reconnectionPending_ has been introduced via #310 .

However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after #310(I tried with b35ae1a):

# Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled.
...
2023-09-26 15:42:05.736 INFO  [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5
2023-09-26 15:42:05.736 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s
...

# Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again.
...
2023-09-26 15:42:05.836 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
2023-09-26 15:42:05.837 WARN  [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16
2023-09-26 15:42:05.837 WARN  [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable
2023-09-26 15:42:05.837 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s
...

# During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled.
# After that, two subscribe requests are sent to broker2.
2023-09-26 15:42:06.034 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError
2023-09-26 15:42:06.515 INFO  [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot>
2023-09-26 15:42:06.515 INFO  [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s
...
2023-09-26 15:42:06.907 INFO  [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool
...
2023-09-26 15:42:06.912 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] 
...
2023-09-26 15:42:07.103 INFO  [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] 
...

To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.

Modifications

The timing for setting reconnectionPending_ to false has been changed.

…ble attempt at reconnecting

BewareMyPower

Good catch. It's similar to my fix for the Java client: apache/pulsar#20595

I just left two comments at the moment and will review it again when I'm back to work next week.

BewareMyPower · 2023-10-14T12:30:26Z

lib/Future.h

@@ -138,6 +138,8 @@ class Promise {

    bool setFailed(Result result) const { return state_->complete(result, {}); }

+    bool setSuccess() const { return state_->complete({}, {}); }


I think we can replace this method with a setValue({}) call or passing a static variable to setValue.

BewareMyPower · 2023-10-14T12:33:48Z

lib/HandlerBase.cc

+            connectionOpened(cnx).addListener([this, self](Result result, bool) {
+                // Do not use bool, only Result.
+                reconnectionPending_ = false;
+                if (result == ResultRetryable) {


Use isResultRetryable() from ResultUtil.h here. Though I admit using two Result enums as retryable enums seems weird.

shustsud · 2023-10-16T04:09:18Z

@BewareMyPower
Thank you for your comments.
I addressed your comments.

…ble attempt at reconnecting (apache#328) Related Issue: apache#235 ### Motivation A potential double scheduling of reconnection due to a broker shutdown was observed. The reconnect can be scheduled with either of the following codes [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209) or [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121) If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`: [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63) --> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217) The problem is that the consumer will no longer be able to manage events coming from the broker. To cope with this issue, a new flag `reconnectionPending_` has been introduced via apache#310 . However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely. In fact, the double reconnects have been observed even after apache#310(I tried with apache@b35ae1a): ``` # Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled. ... 2023-09-26 15:42:05.736 INFO [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5 2023-09-26 15:42:05.736 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s ... # Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again. ... 2023-09-26 15:42:05.836 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool 2023-09-26 15:42:05.837 WARN [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16 2023-09-26 15:42:05.837 WARN [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable 2023-09-26 15:42:05.837 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s ... # During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled. # After that, two subscribe requests are sent to broker2. 2023-09-26 15:42:06.034 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError 2023-09-26 15:42:06.515 INFO [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot> 2023-09-26 15:42:06.515 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s ... 2023-09-26 15:42:06.907 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.912 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... 2023-09-26 15:42:07.103 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... ``` To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed. ### Modifications The timing for setting `reconnectionPending_` to false has been changed. (cherry picked from commit ebae92e)

Delay the timing of setting reconnectionPending to false to avoid dou…

722e6da

…ble attempt at reconnecting

nkurihar added the bug Something isn't working label Oct 13, 2023

BewareMyPower reviewed Oct 14, 2023

View reviewed changes

BewareMyPower added this to the 3.4.0 milestone Oct 16, 2023

Addressed PR comments

328d969

BewareMyPower approved these changes Oct 16, 2023

View reviewed changes

BewareMyPower merged commit ebae92e into apache:main Oct 16, 2023
11 checks passed

BewareMyPower mentioned this pull request Oct 16, 2023

[Bug] Some consumers not reconnecting properly #235

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328

Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328

shustsud commented Oct 13, 2023

BewareMyPower left a comment

BewareMyPower Oct 14, 2023

BewareMyPower Oct 14, 2023

shustsud commented Oct 16, 2023

		@@ -138,6 +138,8 @@ class Promise {

		bool setFailed(Result result) const { return state_->complete(result, {}); }

		bool setSuccess() const { return state_->complete({}, {}); }

Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328

Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328

Conversation

shustsud commented Oct 13, 2023

Motivation

Modifications

BewareMyPower left a comment

Choose a reason for hiding this comment

BewareMyPower Oct 14, 2023

Choose a reason for hiding this comment

BewareMyPower Oct 14, 2023

Choose a reason for hiding this comment

shustsud commented Oct 16, 2023