Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry frequently lost during application start-up when sidecar OTEL collector isn't ready yet, even with retry feature enabled #6587

Closed
swar8080 opened this issue Jul 16, 2024 · 4 comments · Fixed by #6614
Labels
Bug Something isn't working

Comments

@swar8080
Copy link

swar8080 commented Jul 16, 2024

Describe the bug
We're running java applications in kubernetes with an OpenTelemetry Collector sidecar container that an application container exports to. During pod start-up we're occasionally seeing failed exports (based on otlp_exporter_exported{success="false"}). We do have OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED enabled.

I plotted some of the SDK metrics for a pod where this happened:

image

It doesn't look like the errors are retried based on the error logs and a brief look at the gRPC exporter code? Also tested this with the HTTP exporter as well.

[otel.javaagent 2024-07-31 21:19:36:628 +0000] [OkHttp http://localhost:4318/...] ERROR io.opentelemetry.exporter.internal.http.HttpExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318 
java.net.ConnectException: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4318
	at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
	at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
	at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptor.intercept(RetryInterceptor.java:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
	Suppressed: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4318
		... 21 more
	Caused by: java.net.ConnectException: Connection refused
		at java.base/sun.nio.ch.Net.pollConnect(Native Method)
		at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
		at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
		at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
		at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
		at java.base/java.net.Socket.connect(Socket.java:633)
		at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
		at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
		... 20 more
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128)
	at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
	... 20 more
[otel.javaagent] [OkHttp http://localhost:4317/...] WARN io.opentelemetry.exporter.internal.grpc.GrpcExporter - Failed to export metrics. Server responded with gRPC status code 2. Error message: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:4317

(forgot to capture stacktrace)

My guess is there's some sort of race condition where the sidecar collector isn't ready for the export. The queueSize isn't reported as hitting the max that triggers an export. It's really close to that maximum though, so maybe a gauge would miss recording when it hits that threshold?

Steps to reproduce
Probably can reproduce this by exporting without a collector, since I think it'll hit the same error

What did you expect to see?
It'd be nice if the SDK retired the request since the sidecar comes up pretty quickly.

Maybe the SDK retries on these errors for up to X seconds after application start-up. Always retrying on these error might backfire, like during pod termination when the OTEL collector is terminated before the application container

What did you see instead?
Failed exports

What version and what artifacts are you using?
We're using version 2.0.0 of the OpenTelemetry java agent

Environment
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
value: cumulative
- name: OTEL_EXPERIMENTAL_METRICS_CARDINALITY_LIMIT
value: 10000
- name: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT
value: "3000"
- name: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT
value: "60"
- name: OTEL_BSP_MAX_QUEUE_SIZE
value: "10000"
- name: OTEL_METRIC_EXPORT_INTERVAL
value: "15000"
- name: OTEL_METRIC_EXPORT_TIMEOUT
value: "10000"
- name: OTEL_METRICS_EXPORTER
value: none
- name: OTEL_LOGS_EXPORTER
value: none
- name: OTEL_EXPERIMENTAL_EXPORTER_OTLP_RETRY_ENABLED
value: "true"

@swar8080 swar8080 added the Bug Something isn't working label Jul 16, 2024
@swar8080
Copy link
Author

This also happens when the number of spans exported (yellow) is way below OTEL_BSP_MAX_EXPORT_BATCH_SIZE (orange)

image (10)

@jack-berg
Copy link
Member

I think the "Failed to connect to" message indicates a java.net.ConnectException. Would know for sure if you could capture the full stack trace. It doesn't look like we're retrying on this, but we probably should. I suspect this situation is tricky to fake in a test environment, which is likely why we missed it.

@swar8080
Copy link
Author

@jack-berg you're right about java.net.ConnectException. Didn't realize there was a stacktrace earlier but I just added it to the issue description for a failed http/protobuf export

@jack-berg
Copy link
Member

Very helpful to have the stacktrace. I'll see if I can reproduce this and get a fix out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants