-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase in tail_sampling_sampling_trace_dropped_too_early count #29024
Comments
Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I don't have much context here, so I'm going to write my thought process here as a reference. I might be missing something obvious as well. Investigation opentelemetry-collector-contrib/processor/tailsamplingprocessor/processor.go Lines 204 to 208 in b716a4d
This is simply saying that the span's trace ID can't be found or was dropped from the cache. This could be a timing issue as you've referenced, but since you've tested longer wait times I'm wondering if it's related to routing since you have multiple collectors involved here. From the tailsampling processor's README:
This led me to wonder if the load balancing exporter could possibly be the problem here, but it does look like the exporter was written to accommodate this use case. One comment may be relevant:
Request |
I was thinking similar at one point also @crobert-1. Although, from what I understand even if the load balancing layer is sending spans from a single trace id to multiple TSP collectors the span should still be stored when processed. You should just get bad sampling decisions. opentelemetry-collector-contrib/processor/tailsamplingprocessor/processor.go Lines 384 to 394 in ae6d36b
So while the whole trace may not be present, the span that was sent to this specific collector should still be in the map. |
Increase the value of num_traces. The default value is 50000. processors:
tail_sampling:
decision_wait: 120s
num_traces: 50000 |
Increasing the value of |
Thanks for following up @atulrautray, appreciate knowing the solution for future reference! Out of ignorance, were you setting the |
Hi @crobert-1 , yes we had not set a custom value for |
…ace dropped errors (#29513) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> A [recent issue](#29024) was created when a user was getting a high value for the `otelcol_processor_tail_sampling_sampling_trace_dropped_too_early` metric. The solution was to increase the number of traces stored in memory to handle the load. This is done by using the `num_traces` configuration option. This was not clear from documentation or from the error metric's description. I'm not sure of what location is the best place to communicate the solution to this error. The two options in my mind are to either information to the README, or add more information to the metric description itself. Any guidance here is appreciated to know what would be most clear to users. --------- Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Component(s)
cmd/otelcontribcol
Describe the issue you're reporting
We notice that our Otel collector started dropping spans too early in a huge number since last couple of weeks
(Metric -
otelcol_processor_tail_sampling_sampling_trace_dropped_too_early
)Our setup is we have otel collector as a loadbalancing exporter in the front, which exports to another otel collector (act as a sampler).
We see that in both uat and prod, the spans are started getting dropped(too early) in a significant numbers at the same time.
We didn’t make any changes in our sampling policies in the last 1 month. And this span drops affect all apps, not specific to any.
We hit the roadblock in troubleshooting this.
Things we tried so far-
Appreciate any inputs on how to debug further and fix this issue.
Below is our sampling policy. (This policy in place for the last 3 months and traces used to be sampled fine)
These are the metrics-
The text was updated successfully, but these errors were encountered: