[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

BewareMyPower · 2024-04-23T14:03:38Z

Search before asking

I searched in the issues and found nothing similar.

Read release policy

I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

master (89b201e)

Minimal reproduce step

public class ExtensibleLoadManagerCloseTest extends MultiBrokerBaseTest {

    @Override
    protected void startBroker() throws Exception {
        addCustomConfigs(conf);
        super.startBroker();
    }

    @Override
    protected ServiceConfiguration createConfForAdditionalBroker(int additionalBrokerIndex) {
        return addCustomConfigs(getDefaultConf());
    }

    private static ServiceConfiguration addCustomConfigs(ServiceConfiguration config) {
        config.setLoadManagerClassName(ExtensibleLoadManagerImpl.class.getName());
        config.setDefaultNumberOfNamespaceBundles(16);
        config.setLoadBalancerAutoBundleSplitEnabled(false);
        config.setLoadBalancerDebugModeEnabled(true);
        return config;
    }

    @Test
    public void test() throws Exception {
        final var topic = "test";
        admin.topics().createPartitionedTopic(topic, 20);
        admin.lookups().lookupPartitionedTopic(topic);
        @Cleanup final var client = PulsarClient.builder().serviceUrl(pulsar.getBrokerServiceUrl()).build();
        @Cleanup final var producer = (PartitionedProducerImpl<byte[]>) client.newProducer().topic(topic).create();
    }
}

What did you expect to see?

It should not takes too long to complete.

What did you see instead?

It sometimes took about 54 seconds to complete and sometimes 22 seconds.

Anything else?

With the default load manager, it only takes about 15 seconds to complete.

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315

BewareMyPower added the type/bug The PR fixed a bug or issue reported a bug label Apr 23, 2024

BewareMyPower self-assigned this Apr 23, 2024

BewareMyPower mentioned this issue Apr 24, 2024

[fix][broker] Avoid being stuck when closing the broker with extensible load manager #22573

Merged

4 tasks

BewareMyPower mentioned this issue Apr 24, 2024

[fix][broker] Avoid being stuck in 30+ seconds when closing the BrokerService BewareMyPower/pulsar#31

Open

lhotari closed this as completed in #22573 Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

BewareMyPower commented Apr 23, 2024 •

edited

Loading

[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

Comments

BewareMyPower commented Apr 23, 2024 • edited Loading

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

BewareMyPower commented Apr 23, 2024 •

edited

Loading