-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay the timing of setting reconnectionPending to false to avoid double attempt at reconnecting #328
Conversation
…ble attempt at reconnecting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. It's similar to my fix for the Java client: apache/pulsar#20595
I just left two comments at the moment and will review it again when I'm back to work next week.
lib/Future.h
Outdated
@@ -138,6 +138,8 @@ class Promise { | |||
|
|||
bool setFailed(Result result) const { return state_->complete(result, {}); } | |||
|
|||
bool setSuccess() const { return state_->complete({}, {}); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can replace this method with a setValue({})
call or passing a static variable to setValue
.
lib/HandlerBase.cc
Outdated
connectionOpened(cnx).addListener([this, self](Result result, bool) { | ||
// Do not use bool, only Result. | ||
reconnectionPending_ = false; | ||
if (result == ResultRetryable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use isResultRetryable()
from ResultUtil.h
here. Though I admit using two Result
enums as retryable enums seems weird.
@BewareMyPower |
…ble attempt at reconnecting (apache#328) Related Issue: apache#235 ### Motivation A potential double scheduling of reconnection due to a broker shutdown was observed. The reconnect can be scheduled with either of the following codes [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209) or [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121) If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from `cnx`: [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285) -> [ https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63](https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63) --> [ https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217](https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217) The problem is that the consumer will no longer be able to manage events coming from the broker. To cope with this issue, a new flag `reconnectionPending_` has been introduced via apache#310 . However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely. In fact, the double reconnects have been observed even after apache#310(I tried with apache@b35ae1a): ``` # Consumer is connected to broker1, but broker1 shutdown closes Consumer and reconnection is scheduled. ... 2023-09-26 15:42:05.736 INFO [140591970158336] ConsumerImpl:1207 | Broker notification of Closed consumer: 5 2023-09-26 15:42:05.736 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.1 s ... # Consumer attempts to connect to broker1, but fails, and a reconnection is scheduled again. ... 2023-09-26 15:42:05.836 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool 2023-09-26 15:42:05.837 WARN [140591970158336] ClientConnection:1741 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Received error response from server: Retryable (Namespace is being unloaded, cannot add topic persistent://shustsud-test2/test/partitioned-topic-partition-5) -- req_id: 16 2023-09-26 15:42:05.837 WARN [140591970158336] ConsumerImpl:317 | [0x7fde18046510, dummy_24, 5] Failed to reconnect consumer: Retryable 2023-09-26 15:42:05.837 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.194 s ... # During the connection attempt, the connection to broker1 is closed and further reconnection is scheduled. # After that, two subscribe requests are sent to broker2. 2023-09-26 15:42:06.034 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.515 ERROR [140591970158336] ClientConnection:1330 | [<host(client)>:55304 -> <host(broker1)>:<prot>] Connection closed with ConnectError 2023-09-26 15:42:06.515 INFO [140591970158336] ConnectionPool:122 | Remove connection for pulsar+ssl://<host(broker1)>:<prot> 2023-09-26 15:42:06.515 INFO [140591970158336] HandlerBase:147 | [0x7fde18046510, dummy_24, 5] Schedule reconnection in 0.392 s ... 2023-09-26 15:42:06.907 INFO [140591970158336] HandlerBase:80 | [0x7fde18046510, dummy_24, 5] Getting connection from pool ... 2023-09-26 15:42:06.912 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... 2023-09-26 15:42:07.103 INFO [140591970158336] ConsumerImpl:282 | [0x7fde18046510, dummy_24, 5] Created consumer on broker [<host(client)>:54582 -> <host(broker2)>:<prot>] ... ``` To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed. ### Modifications The timing for setting `reconnectionPending_` to false has been changed. (cherry picked from commit ebae92e)
Related Issue: #235
Motivation
A potential double scheduling of reconnection due to a broker shutdown was observed.
The reconnect can be scheduled with either of the following codes
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L1209
or
https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ClientConnection.cc#L1350
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L121
If a second reconnection request is received during the first reconnection attempt, it triggers additional reconnection attempts. If the second reconnection is successful, the consumer is removed from
cnx
:https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L285
-> https://github.com/apache/pulsar-client-cpp/blob/af45a54c10ec5b06e80b683010afd3531457ac64/lib/HandlerBase.cc#L63
--> https://github.com/apache/pulsar-client-cpp/blob/b35ae1aa4b9834886c0889635de81834f9b2f774/lib/ConsumerImpl.cc#L217
The problem is that the consumer will no longer be able to manage events coming from the broker.
To cope with this issue, a new flag
reconnectionPending_
has been introduced via #310 .However, while the above change reduces the likelihood of the problem occurring, it doesn't eliminate the problem entirely.
In fact, the double reconnects have been observed even after #310(I tried with b35ae1a):
To completely eliminate the possibility of the double reconnects, I suggest adjusting the timing of when reconnectionPending_ is set to false. Ideally, this should be done after the handleCreateConsumer method or the handleCreateProducer method has been completed.
Modifications
The timing for setting
reconnectionPending_
to false has been changed.