[fix][broker] Avoid being stuck in 30+ seconds when closing the BrokerService #31
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes apache#22569
Motivation
BrokerService#closeAsync
callsunloadNamespaceBundlesGracefully
to unload namespaces gracefully. With extensible load manager, it eventually callsTableViewLoadDataStoreImpl#validateProducer
:In
validateProducer
, if the producer is not connected, it will recreate the producer synchronously. However, since the state ofPulsarService
has already been changed toClosing
, all connect or lookup requests will fail withServiceNotReady
. Then the client will retry until timeout.Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the
loadbalancer-service-unit-state
topic.Modifications
The major fix:
Before changing PulsarService's state to
Closing
, callBrokerService#unloadNamespaceBundlesGracefully
first to make the load manager complete the unload operations first.Minor fixes:
LoadManager#disableBroker
is done.Verifications
Add
ExtensibleLoadManagerCloseTest
to verify closingPulsarService
won't take too much time. Here are some test results locally:As you can see, each broker takes only about 3 seconds to close due to
OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS
value added in apache#20315