Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

Closed
3 tasks done
BewareMyPower opened this issue Apr 23, 2024 · 0 comments · Fixed by #22573 · May be fixed by BewareMyPower/pulsar#31
Closed
3 tasks done

[Bug] Broker could take 30+ seconds to close with extensible load manager #22569

BewareMyPower opened this issue Apr 23, 2024 · 0 comments · Fixed by #22573 · May be fixed by BewareMyPower/pulsar#31
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Apr 23, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

master (89b201e)

Minimal reproduce step

public class ExtensibleLoadManagerCloseTest extends MultiBrokerBaseTest {

    @Override
    protected void startBroker() throws Exception {
        addCustomConfigs(conf);
        super.startBroker();
    }

    @Override
    protected ServiceConfiguration createConfForAdditionalBroker(int additionalBrokerIndex) {
        return addCustomConfigs(getDefaultConf());
    }

    private static ServiceConfiguration addCustomConfigs(ServiceConfiguration config) {
        config.setLoadManagerClassName(ExtensibleLoadManagerImpl.class.getName());
        config.setDefaultNumberOfNamespaceBundles(16);
        config.setLoadBalancerAutoBundleSplitEnabled(false);
        config.setLoadBalancerDebugModeEnabled(true);
        return config;
    }

    @Test
    public void test() throws Exception {
        final var topic = "test";
        admin.topics().createPartitionedTopic(topic, 20);
        admin.lookups().lookupPartitionedTopic(topic);
        @Cleanup final var client = PulsarClient.builder().serviceUrl(pulsar.getBrokerServiceUrl()).build();
        @Cleanup final var producer = (PartitionedProducerImpl<byte[]>) client.newProducer().topic(topic).create();
    }
}

What did you expect to see?

It should not takes too long to complete.

What did you see instead?

It sometimes took about 54 seconds to complete and sometimes 22 seconds.

Anything else?

With the default load manager, it only takes about 15 seconds to complete.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@BewareMyPower BewareMyPower added the type/bug The PR fixed a bug or issue reported a bug label Apr 23, 2024
@BewareMyPower BewareMyPower self-assigned this Apr 23, 2024
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 24, 2024
…rService

Fixes apache#22569

### Motivation

`BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to
unload namespaces gracefully. With extensible load manager, it
eventually calls `TableViewLoadDataStoreImpl#validateProducer`:

```
BrokerService#unloadNamespaceBundlesGracefully
  ExtensibleLoadManagerWrapper#disableBroker
    ExtensibleLoadManagerImpl#disableBroker
      ServiceUnitStateChannelImpl#cleanOwnerships
        ServiceUnitStateChannelImpl#doCleanup
          TableViewLoadDataStoreImpl#removeAsync
            TableViewLoadDataStoreImpl#validateProducer
```

In `validateProducer`, if the producer is not connected, it will
recreate the producer synchronously. However, since the state of
`PulsarService` has already been changed to `Closing`, all connect or
lookup requests will fail with `ServiceNotReady`. Then the client will
retry until timeout.

Besides, the unload operation could also trigger the reconnection
because the extensible load manager sends the unload event to the
`loadbalancer-service-unit-state` topic.

### Modifications

The major fix:
Before changing PulsarService's state to `Closing`, call
`BrokerService#unloadNamespaceBundlesGracefully` first to make the load
manager complete the unload operations first.

Minor fixes:
- Record the time when `LoadManager#disableBroker` is done.
- Don't check if producer is disconnected because the producer could
  retry if it's disconnected.

### Verifications

Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService`
won't take too much time. Here are some test results locally:

```
2024-04-24T19:43:38,851 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310]
2024-04-24T19:44:26,711 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298]
2024-04-24T19:46:16,791 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263]
2024-04-24T20:13:05,763 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299]
2024-04-24T20:13:43,979 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310]
```

As you can see, each broker takes only about 3 seconds to close due to
`OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in
apache#20315
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 24, 2024
…rService

Fixes apache#22569

### Motivation

`BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to
unload namespaces gracefully. With extensible load manager, it
eventually calls `TableViewLoadDataStoreImpl#validateProducer`:

```
BrokerService#unloadNamespaceBundlesGracefully
  ExtensibleLoadManagerWrapper#disableBroker
    ExtensibleLoadManagerImpl#disableBroker
      ServiceUnitStateChannelImpl#cleanOwnerships
        ServiceUnitStateChannelImpl#doCleanup
          TableViewLoadDataStoreImpl#removeAsync
            TableViewLoadDataStoreImpl#validateProducer
```

In `validateProducer`, if the producer is not connected, it will
recreate the producer synchronously. However, since the state of
`PulsarService` has already been changed to `Closing`, all connect or
lookup requests will fail with `ServiceNotReady`. Then the client will
retry until timeout.

Besides, the unload operation could also trigger the reconnection
because the extensible load manager sends the unload event to the
`loadbalancer-service-unit-state` topic.

### Modifications

The major fix:
Before changing PulsarService's state to `Closing`, call
`BrokerService#unloadNamespaceBundlesGracefully` first to make the load
manager complete the unload operations first.

Minor fixes:
- Record the time when `LoadManager#disableBroker` is done.
- Don't check if producer is disconnected because the producer could
  retry if it's disconnected.

### Verifications

Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService`
won't take too much time. Here are some test results locally:

```
2024-04-24T19:43:38,851 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310]
2024-04-24T19:44:26,711 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298]
2024-04-24T19:46:16,791 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263]
2024-04-24T20:13:05,763 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299]
2024-04-24T20:13:43,979 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310]
```

As you can see, each broker takes only about 3 seconds to close due to
`OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in
apache#20315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
1 participant