You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on otlp_exporter_exported{success="false"}). We do have OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED enabled.
I plotted some of the SDK metrics for a pod where this happened:
It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.
[otel.javaagent 2024-07-31 21:19:36:628 +0000] [OkHttp http://localhost:4318/...] ERROR io.opentelemetry.exporter.internal.http.HttpExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
java.net.ConnectException: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptor.intercept(RetryInterceptor.java:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Suppressed: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4318
... 21 more
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
... 20 more
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
... 20 more
[otel.javaagent] [OkHttp http://localhost:4317/...] WARN io.opentelemetry.exporter.internal.grpc.GrpcExporter - Failed to export metrics. Server responded with gRPC status code 2. Error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4317
(forgot to capture stacktrace)
My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The queueSize isn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?
Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error
What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.
Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container
What did you see instead?
Failed exports
What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent
I think the "Failed to connect to" message indicates a java.net.ConnectException. Would know for sure if you could capture the full stack trace. It doesn't look like we're retrying on this, but we probably should. I suspect this situation is tricky to fake in a test environment, which is likely why we missed it.
@jack-berg you're right about java.net.ConnectException. Didn't realize there was a stacktrace earlier but I just added it to the issue description for a failed http/protobuf export
Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on
otlp_exporter_exported{success="false"}
). We do haveOTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
enabled.I plotted some of the SDK metrics for a pod where this happened:
It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.
My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The
queueSize
isn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error
What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.
Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container
What did you see instead?
Failed exports
What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent
Environment
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
value: cumulative
- name: OTEL_EXPERIMENTAL_METRICS_CARDINALITY_LIMIT
value: 10000
- name: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
value: "3000"
- name: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT
value: "60"
- name: OTEL_BSP_MAX_QUEUE_SIZE
value: "10000"
- name: OTEL_METRIC_EXPORT_INTERVAL
value: "15000"
- name: OTEL_METRIC_EXPORT_TIMEOUT
value: "10000"
- name: OTEL_METRICS_EXPORTER
value: none
- name: OTEL_LOGS_EXPORTER
value: none
- name: OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
value: "true"
The text was updated successfully, but these errors were encountered: