-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211
Conversation
lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId)); | ||
} else { | ||
log.warn("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg); | ||
lookupFuture.complete(newLookupErrorResponse(ServerError.ServiceNotReady, errorMsg, requestId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to return ServiceNotReady
?
why not UnknownError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to guarantee that uncontrollable errprs continue the previous behavior
if (unwrapEx instanceof IllegalStateException) { | ||
// Current broker still hold the bundle's lock, but the bundle is being unloading. | ||
log.info("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg); | ||
lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we know IllegalStateException
is always MetadataError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current case, the IllegalStateException
is only throwing when the namespace bundle is unloading. See https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java#L453-L455C36
} else if (nsData.get().isDisabled()) {
future.completeExceptionally(
new IllegalStateException(String.format("Namespace bundle %s is being unloaded", bundle)));
}
I agree with you, We should clearly define this exception. Since there are so many places that rely on the method NamespaceService.findBrokerServiceUrl
, such as PulsarWebResource.validateTopicOwnershipAsync. We need a separate PR to do focus on it.
if (unwrapEx instanceof IllegalStateException) { | ||
// Current broker still hold the bundle's lock, but the bundle is being unloading. | ||
log.info("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg); | ||
lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a side effect here.
Because the connection can be reset, the previous producer could fail fast when bundle unloaded and move to a new Broker.
This now causes each partition's producer to have to wait for a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the connection can be reset, the previous producer could fail fast when bundle unloaded and move to a new Broker.
The previous producer will finally receive a CommandCloseProducer
and try to reconnect even if the topic is closed without waiting for the client to disconnect, right?
This now causes each partition's producer to have to wait for a timeout.
The partition's producer will try to reconnect according to backoff
's rules, which will not result in a timeout.
I also improve the test to ensure the producer and consumer are still working. See testLookupConnectionNotCloseIfGetUnloadingExOrMetadataEx
.
Maybe I misunderstood what you meant, could you explain the details?
…ndle unloading or metadata ex
b3f9e6e
to
f133a9a
Compare
Rebase master |
Flaky test #21292 detected by the flaky test script. @poorbarcode do you have a chance to check that? |
Sure |
…ndle unloading or metadata ex (#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 09a1720)
…ndle unloading or metadata ex (#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 09a1720)
…ndle unloading or metadata ex (#21211) **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 09a1720)
…ndle unloading or metadata ex (apache#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading
…or namespace bundle ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection.
…ception ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection. Add `ConnectionTest` on a mocked `ClientConnection` object to verify `close()` will not be called.
…ception ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection. Add `ConnectionTest` on a mocked `ClientConnection` object to verify `close()` will not be called.
…ception (#390) ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection. Add `ConnectionTest` on a mocked `ClientConnection` object to verify `close()` will not be called.
…ndle unloading or metadata ex (#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading
…ndle unloading or metadata ex (apache#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 16349e6)
…ndle unloading or metadata ex (apache#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 16349e6)
Motivation
Background: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
see
pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java
Lines 1193 to 1200 in af20a8a
Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.
There are two cases that should be improved:
ServiceNotReady
error, but it should respond with aMetadataError
ServiceNotReady
error.Modifications
MetadataError
if the broker gets a metadata read/write error.MetadataError
if the topic is unloadingDocumentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: x