Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spurious stack traces in log when EventHubProducerClient encounters onSessionRemoteClose #41584

Open
djarnis73 opened this issue Aug 21, 2024 · 32 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@djarnis73
Copy link

djarnis73 commented Aug 21, 2024

Describe the bug
We have a service to sends messages to the eventhub using the com.azure:azure-messaging-eventhubs:5.18.6 artifact, with very low volume (less than 200 msg/day).

Recently (since Aug. 17) we started seeing regular exceptions in our logs which indicates a bug in the library that leads to dropped messages.

Exception or Stack Trace
We se a batch of 4 log.errors and 1 log.warn from the library. First we get a log.error by reactor.core.publisher.Operators:

reactor.core.Exceptions$ErrorCallbackNotImplemented: com.azure.core.amqp.exception.AmqpException: onSessionRemoteClose connectionId[MF_a5ecd4_1724187994893], entityName[cbs-session] condition[Error{condition=amqp:connection:forced, description='The connection was closed by container '69f22c69b0244cbe8b723d4b44f2efe7_G26' because it did not have any active links in the past 300000 milliseconds. TrackingId:69f22c69b0244cbe8b723d4b44f2efe7_G26, SystemTracker:gateway5, Timestamp:2024-08-20T22:42:17', info=null}], errorContext[NAMESPACE: evh-bibis-prod-001.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: cbs-session]
Caused by: com.azure.core.amqp.exception.AmqpException: onSessionRemoteClose connectionId[MF_a5ecd4_1724187994893], entityName[cbs-session] condition[Error{condition=amqp:connection:forced, description='The connection was closed by container '69f22c69b0244cbe8b723d4b44f2efe7_G26' because it did not have any active links in the past 300000 milliseconds. TrackingId:69f22c69b0244cbe8b723d4b44f2efe7_G26, SystemTracker:gateway5, Timestamp:2024-08-20T22:42:17', info=null}], errorContext[NAMESPACE: evh-bibis-prod-001.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: cbs-session]
	at com.azure.core.amqp.implementation.ExceptionUtil.toException(ExceptionUtil.java:90)
	at com.azure.core.amqp.implementation.handler.SessionHandler.onSessionRemoteClose(SessionHandler.java:139)
	at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:152)
	at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
	at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
	at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
	at com.azure.core.amqp.implementation.ReactorExecutor.run(ReactorExecutor.java:91)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

Then we get a log.warn from com.azure.core.amqp.implementation.RequestResponseChannel

The connection was closed by container '34a13da5722a4a3eae91914bf10d8530_G16' because it did not have any active links in the past 300000 milliseconds. TrackingId:34a13da5722a4a3eae91914bf10d8530_G16, SystemTracker:gateway5, Timestamp:2024-08-21T08:27:26, errorContext[NAMESPACE: evh-bibis-prod-001.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: $cbs, REFERENCE_ID: cbs:sender, LINK_CREDIT: 98]

Then another log.error from reactor.core.publisher.Operators with message Operator called default onErrorDropped and stacktrace:

reactor.core.Exceptions$ErrorCallbackNotImplemented: com.azure.core.amqp.exception.AmqpException: onSessionRemoteClose connectionId[MF_029ac1_1724193737281], entityName[cbs-session] condition[Error{condition=amqp:connection:forced, description='The connection was closed by container '34a13da5722a4a3eae91914bf10d8530_G16' because it did not have any active links in the past 300000 milliseconds. TrackingId:34a13da5722a4a3eae91914bf10d8530_G16, SystemTracker:gateway5, Timestamp:2024-08-21T08:27:26', info=null}], errorContext[NAMESPACE: evh-bibis-prod-001.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: cbs-session]
Caused by: com.azure.core.amqp.exception.AmqpException: onSessionRemoteClose connectionId[MF_029ac1_1724193737281], entityName[cbs-session] condition[Error{condition=amqp:connection:forced, description='The connection was closed by container '34a13da5722a4a3eae91914bf10d8530_G16' because it did not have any active links in the past 300000 milliseconds. TrackingId:34a13da5722a4a3eae91914bf10d8530_G16, SystemTracker:gateway5, Timestamp:2024-08-21T08:27:26', info=null}], errorContext[NAMESPACE: evh-bibis-prod-001.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: cbs-session]
	at com.azure.core.amqp.implementation.ExceptionUtil.toException(ExceptionUtil.java:90)
	at com.azure.core.amqp.implementation.handler.SessionHandler.onSessionRemoteClose(SessionHandler.java:139)
	at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:152)
	at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
	at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
	at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
	at com.azure.core.amqp.implementation.ReactorExecutor.run(ReactorExecutor.java:91)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

Then another log.error from com.azure.core.amqp.implementation.AmqpChannelProcessor with the exception message:

Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null

And finally an internal NPE, logged by reactor.core.publisher.Operators (which I suspect leads to the dropped message):

reactor.core.Exceptions$ErrorCallbackNotImplemented: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
Caused by: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
	at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:91)
	at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:39)
	at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$15(ReactorConnection.java:311)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(Unknown Source)
	at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$16(ReactorConnection.java:308)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.request(MonoIgnoreThen.java:164)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.request(FluxHide.java:152)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:139)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:144)
	at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.request(Operators.java:2331)
	at com.azure.core.amqp.implementation.AmqpChannelProcessor.requestUpstream(AmqpChannelProcessor.java:336)
	at com.azure.core.amqp.implementation.AmqpChannelProcessor.lambda$onError$4(AmqpChannelProcessor.java:230)
	at reactor.core.publisher.LambdaMonoSubscriber.onNext(LambdaMonoSubscriber.java:171)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:270)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:285)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

To Reproduce
I do not have steps to reproduce, but I can see we are not the only one seeing the issue (looks related to #41535 and perhaps 41489), But I do suspect a simple app that sends test messages at a very low frequency will trigger the issue.

Code Snippet
We instantiate our client very simple:

@Bean
public EventHubProducerClient eventHubProducerClient() {
     return new EventHubClientBuilder()
            .connectionString(properties.connectionString())
            .shareConnection()
            .buildProducerClient();
}

And the sending code is like:

public void sendEvent(String eventBody) {
    eventHubProducerClient.send(List.of(new EventData(eventBody)));
}

Expected behavior
If the server disconnects the client due to inactivity, the client should notice and reconnect at next send attempt.

Setup (please complete the following information):

  • Library/Libraries: com.azure:azure-messaging-eventhubs:5.18.6 / com.azure:azure-identity:1.13.2
  • Java version: 21
  • App Server/Environment: Tomcat in spring boot running in a docker container
  • Frameworks: Spring Boot

Additional information
I suspect an infrastructure applied change around August 17. that perhaps lowered the timeout on the server side has lead to this surfacing.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs labels Aug 21, 2024
Copy link

@anuchandy @conniey @lmolkova

@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 21, 2024
Copy link

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@clevelandcs
Copy link

We're also encountering this issue. Sounds like with the other two issues mentioned above it's pretty wide spread for apps that have a low volume period. The comment on #41535 about a V2 beta opt-in isn't very preferable for an immediate fix for production environments. As is we're looking at if a version downgrade helps the problem or if switching to the non-async client and initializing/disposing of the client periodically is a viable workaround. Having a fix for the async client would be very much preferred and I'm surprised this is made it through testing when it sounds like there are tests for low frequency clients in place.

@anuchandy
Copy link
Member

anuchandy commented Aug 22, 2024

To clarify, the cause of Service Bus issue 41489 is unrelated though both log shows link disconnects; Service Bus 41489 was a thread bottleneck problem.

The Service Bus has been running on the V2 stack engine for several months and is generally available. Event Hub is currently being integrated into the V2 stack engine and is in the testing and feature parity phase, with a beta release in progress. From our local testing, EH with V2 stack engine do not face this issue observed in lower traffic environment.

Between @djarnis73, I'll clean up your comment in 41489 to prevent any confusion for others running into issue similar to that Service Bus case or this Event Hub issue. Hope that's alright with you.

@djarnis73
Copy link
Author

Sure, go ahead. I'm not too deep into our application or the services it uses I just happened to be the one investigating the issues or log monitoring found. So a lot of guesswork on my side.

@djarnis73
Copy link
Author

djarnis73 commented Aug 22, 2024

I'm going to try to implement a scheduled sending of keep-alive messages (once every minute) to see if that will stop the bleeding, this will of course require the receiving end being able to cope with these messages. Will revert with my findings.

@djarnis73
Copy link
Author

djarnis73 commented Aug 22, 2024

I think trying out the V2 beta would be an option for us. Where can I find the details on how to test it?

@djarnis73
Copy link
Author

So I'm pretty clueless when it comes to reactive programming but did some googling and was wondering if adding something like this:

static {
    reactor.core.publisher.Hooks.onErrorDropped(t -> {
        log.error("default onErrorDropped received, rethrowing", t);
        throw new RuntimeException(t);
    });
}

would ensure that the exception would travel up the reactive stack and finally be thrown as an exception to the caller? As to be notified about the failure? In our case we are passing messages around and if we got an exception we could put the original message on a dead-letter-queue to avoid losing it.

@anuchandy
Copy link
Member

Hi @djarnis73, unfortunately re-throwing error from the global Reactor hook will not bubble the error to the actual send API call that application made. If we look at the first call stack in the issue description, it does not have send API call in the stack, so this error we re-throw from the hook will be signaled to thread happens to invoke the hook, likely goes to the default uncaught exception handler associated with the thread.

@conniey
Copy link
Member

conniey commented Aug 22, 2024

I think trying out the V2 beta would be an option for us. Where can I find the details on how to test it?

The beta is available as: azure-messaging-eventhubs 5.19.0-beta.2. Setting the environment variable com.azure.messaging.eventhubs.v2=true when running your application will use the v2 stack.

new EventHubClientBuilder()
        .connectionString(properties.connectionString())
        .configuration(new com.azure.core.util.ConfigurationBuilder()
                .putProperty("com.azure.messaging.eventhubs.v2", "true")
                .build())
        .buildProducerClient();

@djarnis73
Copy link
Author

I'm going to try to implement a scheduled sending of keep-alive messages (once every minute) to see if that will stop the bleeding, this will of course require the receiving end being able to cope with these messages. Will revert with my findings.

Initial result of this looks good, no exceptions since it was deployed around 18 hours ago. But I do feel it is only a matter of time before another onSessionRemoteClose (perhaps triggered by something different than a timeout), so I still think I will try out the V2.

@djarnis73
Copy link
Author

djarnis73 commented Aug 27, 2024

I deployed a version of our app to our test environment with V2 enabled and my manual keep alive removed. It does not seem to resolve the issue (the errors in the logs are back). So it looks like V2 has not fixed the issue. Any clues on how to investigate this any further (like enable debug logging for specific logger to ensure I have enabled V2 correctly)?

@anuchandy
Copy link
Member

Hi @djarnis73 - regarding logging, a reference log4j2.xml and logback.xml can be found here with DEBUG logging enabled.

To verify V2 stack is enabled, we can check if there are logs with class name "com.azure.core.amqp.implementation.ReactorConnectionCache", which should be present at the start of the logs when a connection is established.

To give an additional context, even with v2, we will still see the disconnect events - "onSessionRemoteClose", "the connection was closed by container…". The broker will disconnect if there is no activity, and the next send attempt by the application should force library to reconnect.

Can we check

  1. If V2 is enabled by inspecting if there are logs from "ReactorConnectionCache" class in logs,
  2. If we still see this error with V2 - "Error: java.lang.NullPointerException: Cannot invoke ava.util.List.add(Object) because this._sessions is null",
  3. If next send attempt reconnects and send events though there were session/connection close/disconnects.

@clevelandcs
Copy link

clevelandcs commented Aug 28, 2024

@anuchandy
We have also been testing with beta2.

  1. Confirmed ReactorConnectionCache appears in logs at startup
  2. Still seeing Cannot invoke java.util.List.add(Object) because this._sessions is null"
  3. The behavior of
    a. App startup
    b. Connection established
    c. Send events processed
    d. idle timeout - connection closed by container

After the first idle timeout connection closed by container we see the idle timeout exception 3 times over a period of about 1 minute immediately followed by the stacktrace related to Cannot invoke java.util.List.add(Object) because this._sessions is null" . After this point we see one final error message from the SDK and no further connection attempts: "az.sdk.message":"Error in AMQP channel processor."

@anuchandy
Copy link
Member

@clevelandcs, thanks for getting back.

  1. Could you share the full Debug logs (that uses the log settings in the reference log4j2.xml/logback.xml) since app startup?

  2. After the final error, did you find the next send API call gets processed successfully? (the 3rd point to check in the previous comment)?

@djarnis73
Copy link
Author

I'm i bit hung up due to some problem regarding deployment to our test environment. But I have been analyzing number of events on producer and consumer side and they are identical, so I do not think the library is dropping events (on the producer side at least). So I think we should change the title of this issue to something in the line of "Spurious stack traces in log when EventHubProducerClient encounters onSessionRemoteClose".

@anuchandy
Copy link
Member

Thanks for the update @djarnis73. We couldn't replicate the NPE locally or in our testing setup, but I think the rework in this PR should resolve the potential NPE edge cases.

(I'll update issue title per your latest comment)

@anuchandy anuchandy changed the title [BUG] When EventHubProducerClient encounters onSessionRemoteClose it starts dropping messages [BUG] Spurious stack traces in log when EventHubProducerClient encounters onSessionRemoteClose Aug 30, 2024
@djarnis73
Copy link
Author

Ok, so I managed to deploy V2 on our test environment on Friday, so it has now been running over the weekend, what I have observed so far is:

  • ReactorConnectionCache is logging (I see messages like: Waiting to connect and active. / Emitting the new active connection. / closing connection (Terminating the connection recovery support.). / closing connection (Terminating the connection recovery support.).)
  • I still see the NPE (has happened 5 times since friday), will post full stack trace below.
java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
	at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:91)
	at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:39)
	at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$15(ReactorConnection.java:311)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(Unknown Source)
	at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$16(ReactorConnection.java:308)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.request(MonoIgnoreThen.java:164)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.request(FluxHide.java:152)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:139)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:144)
	at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.request(Operators.java:2331)
	at com.azure.core.amqp.implementation.AmqpChannelProcessor.requestUpstream(AmqpChannelProcessor.java:336)
	at com.azure.core.amqp.implementation.AmqpChannelProcessor.lambda$onError$4(AmqpChannelProcessor.java:230)
	at reactor.core.publisher.LambdaMonoSubscriber.onNext(LambdaMonoSubscriber.java:171)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:270)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:285)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

Let me know what else I can provide (I just now managed to enable full debug logging, so I can probably provide some of those tomorrow).

@anuchandy
Copy link
Member

Thanks @djarnis73. Could you retrieve the logs for 20 minutes prior to and 20 minutes following one of the NPE instances? Please keep the thread name, timestamp also in the logs.

@djarnis73
Copy link
Author

djarnis73 commented Sep 4, 2024

Hi

I have uploaded an extract from our datadog logs ±20 minutes around the stack trace (nothing happens 2 seconds after the exception so it is mainly stuff before). It is a csv with a timestamp column and then a json object with the full structured log data.

Let me know if it is useful.

@anuchandy
Copy link
Member

@djarnis73, I'm looking at the logs. Could you please keep the link names in the logs so we can corelate actions on same links? (The link name is not sensitive data in case you're worried about it). Also, could you pull an additional -40 logs, i.e. total -60 minutes around the stack trace, so we can also trace a few token refresh operations.

@djarnis73
Copy link
Author

djarnis73 commented Sep 5, 2024

Hi @anuchandy , the link names are all the same, I did a simple search and replace for a single word, so you can treat all the entries as being the the same. If you need to correlate it with server side value I can share it with you.

@anuchandy
Copy link
Member

Thanks @djarnis73 for confirming that there is only one link, there is no need to correlate with the server traces at the moment.

  1. Could you share the additional -40 minutes logs as requested, so total 60 minutes of logs in the range [2024-09-03T21:40:00.000Z to 2024-09-03T22:40:14.293Z]
  2. Could you share the resource (cpu, memory) allocated for the test env Docker instance from which these logs are extracted.

@djarnis73
Copy link
Author

Hi @anuchandy

I have attached to extended log extract (I also added a few extra columns to try to workaround that datadog does not handle that you log the 'message' in az.sdk.message very well).

I don't have direct access to cpu / memory info about the docker instance, but will reach out to our devops to to see if I can provide that information, I did check the metrics and cpu usage for this service is never above 1% (low volume very simple functionality).

@anuchandy
Copy link
Member

Hi @djarnis73, thanks for the detailed log and cleaning up it, it was very useful!

I did the RCA - The Connection supports various communication pipes: channel, send link, and receive link. One specific channel, CBS, handles authentication and can face transient disconnects. We use an AmqpChannelProcessor (ACP) to automatically recreate the CBS channel if it disconnects, provided the parent Connection is healthy. If the parent Connection disconnects, the ACP is shut down to prevent it from creating new CBS channels. The issue was that when the connection disconnected, the connection initiated the ACP shutdown, but a Thread in the ACP attempted to recreate the channel before the shutdown process was fully completed. Attempt to use a disconnected connection caused the NPE. Depending on the timing, this issue may or may not occur, for instance, if the Thread in the ACP receives the shutdown signal (from connection Thread) before it retries channel creation.

We're introducing a new type called RequestResponseChannelCache to replace AmqpChannelProcessor, avoiding channel creation on disposed connections. These changes have already merged and will be included in the September release happening soon. In the Event Hubs September version, the RequestResponseChannelCache will be automatically selected when the V2 stack is opted in.

I'll update the thread once the new version is out so you can validate it in the test environment. thank you!

@anuchandy anuchandy self-assigned this Sep 6, 2024
@yeah007
Copy link

yeah007 commented Sep 9, 2024

This issue also happened in our environment. When will this release with fix be published? Thanks.

@djarnis73
Copy link
Author

Great news, I will test it out when it becomes available.

@anuchandy
Copy link
Member

Hello @yeah007, assuming no unexpected issues, the updated Event Hubs library with the fix should be available before September 26th.

@anuchandy
Copy link
Member

Great news, I will test it out when it becomes available.

Hello @djarnis73, Sounds good. Also, if you hear from the Dev Ops about the cores allocated to the impacted Docker instance, please share it, keeping environment details is useful for future reference, even after RCA.

@s-vivien
Copy link

s-vivien commented Sep 20, 2024

To give an additional context, even with v2, we will still see the disconnect events - "onSessionRemoteClose", "the connection was closed by container…". The broker will disconnect if there is no activity, and the next send attempt by the application should force library to reconnect.

Hello @anuchandy ,

Is the last sentence also true with the current version ? Or will the auto-reconnect only happen in the V2 ?
We're facing the same errors right now and even though it seem to reconnect, I'm not 100% sure it will not end up crashing for good in the long run.

tl;dr; can we ignore these errors ?

@anuchandy
Copy link
Member

anuchandy commented Sep 20, 2024

Hello @s-vivien, regardless of V1 or V2, the reconnect will occur when the application calls send() again later. Shortly before this NPE event, the system is already in the process of disposing of an unusable connection and its state, which will subsequently be replaced with a new connection, putting system back to a clean state. Since the NPE happens within the state that is about to be discarded, I believe it is safe to ignore the error.

@anuchandy
Copy link
Member

Hello @djarnis73, @clevelandcs, we've just released the beta.3 with the new AMQP session cache, appreciate if you can give it a try and see experience is improved. To verify,

  1. add beta.3 dependency
<dependency>
    <groupId>com.azure</groupId>
    <artifactId>azure-messaging-eventhubs</artifactId>
    <version>5.19.0-beta.3</version>
</dependency>
  1. opt-in v2 stack

Use configuration(..) API in the builder to opt-in to v2 stack as shown below -

new EventHubClientBuilder()
        .connectionString(properties.connectionString())
        .configuration(new com.azure.core.util.ConfigurationBuilder()
                .putProperty("com.azure.messaging.eventhubs.v2", "true")
                .build())
        .buildProducerClient();
  1. verify the right dependencies are resolved

Use mvn dependency:tree to confirm the expected dependency tree, which should look like below (com.azure:azure-messaging-eventhubs:jar:5.19.0-beta.3 and com.azure:azure-core-amqp:jar:2.9.9).

[INFO] org.example:morgan-stanley:jar:1.0-SNAPSHOT
[INFO] +- com.azure:azure-messaging-eventhubs:jar:5.19.0-beta.3:compile
[INFO] |  +- com.azure:azure-core:jar:1.52.0:compile
[INFO] |  |  +- ..
[INFO] |  |  \- io.projectreactor:reactor-core:jar:3.4.38:compile
[INFO] |  |     \- org.reactivestreams:reactive-streams:jar:1.0.4:compile
[INFO] |  \- com.azure:azure-core-amqp:jar:2.9.9:compile
[INFO] |     +- com.microsoft.azure:qpid-proton-j-extensions:jar:1.2.5:compile
[INFO] |     \- org.apache.qpid:proton-j:jar:0.34.1:compile

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

6 participants