Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

swar8080 · 2024-07-16T21:49:58Z

Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on otlp_exporter_exported{success="false"}). We do have OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED enabled.

I plotted some of the SDK metrics for a pod where this happened:

It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.

[otel.javaagent 2024-07-31 21:19:36:628 +0000] [OkHttp http://localhost:4318/...] ERROR io.opentelemetry.exporter.internal.http.HttpExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318 
java.net.ConnectException: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
	at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
	at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
	at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptor.intercept(RetryInterceptor.java:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
	Suppressed: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4318
		... 21 more
	Caused by: java.net.ConnectException: Connection refused
		at java.base/sun.nio.ch.Net.pollConnect(Native Method)
		at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
		at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
		at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
		at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
		at java.base/java.net.Socket.connect(Socket.java:633)
		at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
		at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
		... 20 more
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
	at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
	... 20 more

[otel.javaagent] [OkHttp http://localhost:4317/...] WARN io.opentelemetry.exporter.internal.grpc.GrpcExporter - Failed to export metrics. Server responded with gRPC status code 2. Error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4317

(forgot to capture stacktrace)

My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The queueSize isn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?

Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error

What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.

Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container

What did you see instead?
Failed exports

What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent

Environment
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
value: cumulative
- name: OTEL_EXPERIMENTAL_METRICS_CARDINALITY_LIMIT
value: 10000
- name: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
value: "3000"
- name: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT
value: "60"
- name: OTEL_BSP_MAX_QUEUE_SIZE
value: "10000"
- name: OTEL_METRIC_EXPORT_INTERVAL
value: "15000"
- name: OTEL_METRIC_EXPORT_TIMEOUT
value: "10000"
- name: OTEL_METRICS_EXPORTER
value: none
- name: OTEL_LOGS_EXPORTER
value: none
- name: OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
value: "true"

The text was updated successfully, but these errors were encountered:

swar8080 · 2024-07-31T21:54:41Z

This also happens when the number of spans exported (yellow) is way below OTEL_BSP_MAX_EXPORT_BATCH_SIZE (orange)

jack-berg · 2024-07-31T22:08:07Z

I think the "Failed to connect to" message indicates a java.net.ConnectException. Would know for sure if you could capture the full stack trace. It doesn't look like we're retrying on this, but we probably should. I suspect this situation is tricky to fake in a test environment, which is likely why we missed it.

swar8080 · 2024-07-31T22:57:19Z

@jack-berg you're right about java.net.ConnectException. Didn't realize there was a stacktrace earlier but I just added it to the issue description for a failed http/protobuf export

jack-berg · 2024-07-31T23:08:59Z

Very helpful to have the stacktrace. I'll see if I can reproduce this and get a fix out.

swar8080 added the Bug Something isn't working label Jul 16, 2024

jack-berg mentioned this issue Aug 1, 2024

Retry ConnectException, add retry logging #6614

Merged

jack-berg closed this as completed in #6614 Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

swar8080 commented Jul 16, 2024 •

edited

Loading

swar8080 commented Jul 31, 2024

jack-berg commented Jul 31, 2024

swar8080 commented Jul 31, 2024

jack-berg commented Jul 31, 2024

Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

Comments

swar8080 commented Jul 16, 2024 • edited Loading

swar8080 commented Jul 31, 2024

jack-berg commented Jul 31, 2024

swar8080 commented Jul 31, 2024

jack-berg commented Jul 31, 2024

swar8080 commented Jul 16, 2024 •

edited

Loading