Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Sep 20, 2023

Motivation

Background: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
see

private void checkServerError(ServerError error, String errMsg) {
if (ServerError.ServiceNotReady.equals(error)) {
log.error("{} Close connection because received internal-server error {}", ctx.channel(), errMsg);
ctx.close();
} else if (ServerError.TooManyRequests.equals(error)) {
incrementRejectsAndMaybeClose();
}
}

Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:

  • If the broker gets a metadata read/write error, the broker responds with a ServiceNotReady error, but it should respond with a MetadataError
  • If the topic is unloading, the broker responds with a ServiceNotReady error.

Modifications

  • Respond to the client with a MetadataError if the broker gets a metadata read/write error.
  • Respond to the client with a MetadataError if the topic is unloading

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 20, 2023
@poorbarcode poorbarcode self-assigned this Sep 20, 2023
@poorbarcode poorbarcode added release/3.0.2 release/2.11.3 release/2.10.6 category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost labels Sep 20, 2023
@poorbarcode poorbarcode added this to the 3.2.0 milestone Sep 20, 2023
lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId));
} else {
log.warn("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg);
lookupFuture.complete(newLookupErrorResponse(ServerError.ServiceNotReady, errorMsg, requestId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to return ServiceNotReady?

why not UnknownError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to guarantee that uncontrollable errprs continue the previous behavior

if (unwrapEx instanceof IllegalStateException) {
// Current broker still hold the bundle's lock, but the bundle is being unloading.
log.info("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg);
lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know IllegalStateException is always MetadataError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current case, the IllegalStateException is only throwing when the namespace bundle is unloading. See https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java#L453-L455C36

} else if (nsData.get().isDisabled()) {
    future.completeExceptionally(
            new IllegalStateException(String.format("Namespace bundle %s is being unloaded", bundle)));
}

I agree with you, We should clearly define this exception. Since there are so many places that rely on the method NamespaceService.findBrokerServiceUrl, such as PulsarWebResource.validateTopicOwnershipAsync. We need a separate PR to do focus on it.

if (unwrapEx instanceof IllegalStateException) {
// Current broker still hold the bundle's lock, but the bundle is being unloading.
log.info("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg);
lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a side effect here.
Because the connection can be reset, the previous producer could fail fast when bundle unloaded and move to a new Broker.
This now causes each partition's producer to have to wait for a timeout.

Copy link
Contributor Author

@poorbarcode poorbarcode Sep 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@315157973

Because the connection can be reset, the previous producer could fail fast when bundle unloaded and move to a new Broker.

The previous producer will finally receive a CommandCloseProducer and try to reconnect even if the topic is closed without waiting for the client to disconnect, right?

This now causes each partition's producer to have to wait for a timeout.

The partition's producer will try to reconnect according to backoff's rules, which will not result in a timeout.

I also improve the test to ensure the producer and consumer are still working. See testLookupConnectionNotCloseIfGetUnloadingExOrMetadataEx.

Maybe I misunderstood what you meant, could you explain the details?

@poorbarcode
Copy link
Contributor Author

Rebase master

@poorbarcode poorbarcode merged commit 09a1720 into apache:master Sep 27, 2023
46 checks passed
@lhotari
Copy link
Member

lhotari commented Oct 1, 2023

Flaky test #21292 detected by the flaky test script. @poorbarcode do you have a chance to check that?

@poorbarcode
Copy link
Contributor Author

Flaky test #21292 detected by the flaky test script. @poorbarcode do you have a chance to check that?

Sure

poorbarcode added a commit that referenced this pull request Oct 7, 2023
…ndle unloading or metadata ex (#21211)

### Motivation

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

### Modifications
- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading

(cherry picked from commit 09a1720)
poorbarcode added a commit that referenced this pull request Oct 8, 2023
…ndle unloading or metadata ex (#21211)

### Motivation

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

### Modifications
- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading

(cherry picked from commit 09a1720)
poorbarcode added a commit that referenced this pull request Oct 8, 2023
…ndle unloading or metadata ex (#21211)

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading

(cherry picked from commit 09a1720)
liangyuanpeng pushed a commit to liangyuanpeng/pulsar that referenced this pull request Oct 11, 2023
…ndle unloading or metadata ex (apache#21211)

### Motivation

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. 
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

### Modifications
- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading
BewareMyPower added a commit to BewareMyPower/pulsar-client-cpp that referenced this pull request Jan 30, 2024
…or namespace bundle

### Motivation

When the broker failed to acquire the ownership of a namespace bundle by
`LockBusyException`. It means there is another broker that has acquired
the metadata store path and didn't release that path. For example:

Broker 1:

```
2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000
   Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked
```

Broker 2:

```
2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds
```

After broker 2 released the lock at 23:35:36,650, the lookup request to
broker 1 should tell the client that namespace bundle
0x50000000_0x51000000 is currently being unloaded and in the next retry
the client will connect to the new owner broker.

Here is another typical error:

```
2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded
```

Though after apache/pulsar#21211, the server
error becomes `MetadataError` rather than `ServiceNotReady`.

However, since the `ServerError` is `ServiceNotReady`, the client will
close the connection. If there are many other producers or consumers on
the same connection, they will all reestablish connection to the broker,
which is unnecessary and brings much pressure to broker side.

### Modifications

In `checkServerError`, when the error code is `ServiceNotReady`, check
the error message as well, if it hit the case in `handleLookupError`, do
not close the connection.
BewareMyPower added a commit to BewareMyPower/pulsar-client-cpp that referenced this pull request Jan 31, 2024
…ception

### Motivation

When the broker failed to acquire the ownership of a namespace bundle by
`LockBusyException`. It means there is another broker that has acquired
the metadata store path and didn't release that path. For example:

Broker 1:

```
2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000
   Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked
```

Broker 2:

```
2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds
```

After broker 2 released the lock at 23:35:36,650, the lookup request to
broker 1 should tell the client that namespace bundle
0x50000000_0x51000000 is currently being unloaded and in the next retry
the client will connect to the new owner broker.

Here is another typical error:

```
2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded
```

Though after apache/pulsar#21211, the server
error becomes `MetadataError` rather than `ServiceNotReady`.

However, since the `ServerError` is `ServiceNotReady`, the client will
close the connection. If there are many other producers or consumers on
the same connection, they will all reestablish connection to the broker,
which is unnecessary and brings much pressure to broker side.

### Modifications

In `checkServerError`, when the error code is `ServiceNotReady`, check
the error message as well, if it hit the case in `handleLookupError`, do
not close the connection.

Add `ConnectionTest` on a mocked `ClientConnection` object to verify
`close()` will not be called.
BewareMyPower added a commit to BewareMyPower/pulsar-client-cpp that referenced this pull request Jan 31, 2024
…ception

### Motivation

When the broker failed to acquire the ownership of a namespace bundle by
`LockBusyException`. It means there is another broker that has acquired
the metadata store path and didn't release that path. For example:

Broker 1:

```
2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000
   Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked
```

Broker 2:

```
2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds
```

After broker 2 released the lock at 23:35:36,650, the lookup request to
broker 1 should tell the client that namespace bundle
0x50000000_0x51000000 is currently being unloaded and in the next retry
the client will connect to the new owner broker.

Here is another typical error:

```
2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded
```

Though after apache/pulsar#21211, the server
error becomes `MetadataError` rather than `ServiceNotReady`.

However, since the `ServerError` is `ServiceNotReady`, the client will
close the connection. If there are many other producers or consumers on
the same connection, they will all reestablish connection to the broker,
which is unnecessary and brings much pressure to broker side.

### Modifications

In `checkServerError`, when the error code is `ServiceNotReady`, check
the error message as well, if it hit the case in `handleLookupError`, do
not close the connection.

Add `ConnectionTest` on a mocked `ClientConnection` object to verify
`close()` will not be called.
BewareMyPower added a commit to apache/pulsar-client-cpp that referenced this pull request Feb 2, 2024
…ception (#390)

### Motivation

When the broker failed to acquire the ownership of a namespace bundle by
`LockBusyException`. It means there is another broker that has acquired
the metadata store path and didn't release that path. For example:

Broker 1:

```
2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000
   Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked
```

Broker 2:

```
2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds
```

After broker 2 released the lock at 23:35:36,650, the lookup request to
broker 1 should tell the client that namespace bundle
0x50000000_0x51000000 is currently being unloaded and in the next retry
the client will connect to the new owner broker.

Here is another typical error:

```
2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded
```

Though after apache/pulsar#21211, the server
error becomes `MetadataError` rather than `ServiceNotReady`.

However, since the `ServerError` is `ServiceNotReady`, the client will
close the connection. If there are many other producers or consumers on
the same connection, they will all reestablish connection to the broker,
which is unnecessary and brings much pressure to broker side.

### Modifications

In `checkServerError`, when the error code is `ServiceNotReady`, check
the error message as well, if it hit the case in `handleLookupError`, do
not close the connection.

Add `ConnectionTest` on a mocked `ClientConnection` object to verify
`close()` will not be called.
Technoboy- pushed a commit that referenced this pull request Feb 27, 2024
…ndle unloading or metadata ex (#21211)

### Motivation

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. 
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

### Modifications
- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 1, 2024
…ndle unloading or metadata ex (apache#21211)

### Motivation

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

### Modifications
- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading

(cherry picked from commit 16349e6)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 6, 2024
…ndle unloading or metadata ex (apache#21211)

### Motivation

**Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:
- If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError`
- If the topic is unloading, the broker responds with a `ServiceNotReady` error.

### Modifications
- Respond to the client with a `MetadataError` if the broker gets a metadata read/write error.
- Respond to the client with a `MetadataError` if the topic is unloading

(cherry picked from commit 16349e6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants