Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

atulrautray · 2023-11-08T14:02:11Z

Component(s)

cmd/otelcontribcol

Describe the issue you're reporting

We notice that our Otel collector started dropping spans too early in a huge number since last couple of weeks
(Metric - otelcol_processor_tail_sampling_sampling_trace_dropped_too_early)

Our setup is we have otel collector as a loadbalancing exporter in the front, which exports to another otel collector (act as a sampler).

We see that in both uat and prod, the spans are started getting dropped(too early) in a significant numbers at the same time.
We didn’t make any changes in our sampling policies in the last 1 month. And this span drops affect all apps, not specific to any.
We hit the roadblock in troubleshooting this.

Things we tried so far-

Verified no infrastructure/network changes performed in the running clusters
Verified no changes in the applications, app performance in terms of response time, latency
Increased the decision_wait to 5mins, no effect in the span drops
Updated the collector version to latest release (0.88.0), no effect

Appreciate any inputs on how to debug further and fix this issue.
Below is our sampling policy. (This policy in place for the last 3 months and traces used to be sampled fine)

    processors:
      batch/1:
        timeout: 10s
      batch/2:
        timeout: 10s

      tail_sampling:
        decision_wait: 120s
        policies:
          [
              {
                name: service-inclusion-policy,
                type: string_attribute,
                string_attribute: {key: service.name, values: [app1, app2, app3]}
              },
              {
                name: otel-error-code-policy,
                type: status_code,
                status_code: {status_codes: [ERROR]}
              },
              {
                name: http-status-code-policy,
                type: numeric_attribute,
                numeric_attribute: { key: http.status_code, min_value: 400, max_value: 600 }
              },              
              {
                name: probabilistic-ok-status-inclusion-policy,
                type: and,
                and: {
                  and_sub_policy:
                  [
                    {
                      name: otel-ok-status-code-policy,
                      type: status_code,
                      status_code: { status_codes: [OK] }
                    },
                    {
                      name: probabilistic-policy,
                      type: probabilistic,
                      probabilistic: {sampling_percentage: 10}
                    }
                  ]
                }
              }
          ]

These are the metrics-

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-08T16:45:55Z

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2023-11-09T20:20:53Z

I don't have much context here, so I'm going to write my thought process here as a reference. I might be missing something obvious as well.

Investigation
The failure is happening here in code:

opentelemetry-collector-contrib/processor/tailsamplingprocessor/processor.go

Lines 204 to 208 in b716a4d

    
           d, ok := tsp.idToTrace.Load(id) 
        
           if !ok { 
        
           	metrics.idNotFoundOnMapCount++ 
        
           	continue 
        
           }

This is simply saying that the span's trace ID can't be found or was dropped from the cache. This could be a timing issue as you've referenced, but since you've tested longer wait times I'm wondering if it's related to routing since you have multiple collectors involved here.

From the tailsampling processor's README:

All spans for a given trace MUST be received by the same collector instance for effective sampling decisions.

This led me to wonder if the load balancing exporter could possibly be the problem here, but it does look like the exporter was written to accommodate this use case. One comment may be relevant:

if routing stability is important for your use case and your list of backends are constantly changing, consider using the groupbytrace processor. This way, traces are dispatched atomically to this exporter, and the same decision about the backend is made for the trace as a whole.

Request
Can you share the full configuration of the collector you're using the load balancing exporter in?

bryan-aguilar · 2023-11-09T20:51:28Z

I was thinking similar at one point also @crobert-1. Although, from what I understand even if the load balancing layer is sending spans from a single trace id to multiple TSP collectors the span should still be stored when processed. You should just get bad sampling decisions.

opentelemetry-collector-contrib/processor/tailsamplingprocessor/processor.go

Lines 384 to 394 in ae6d36b

    
           d, loaded := tsp.idToTrace.Load(id) 
        
           if !loaded { 
        
           	spanCount := &atomic.Int64{} 
        
           	spanCount.Store(lenSpans) 
        
           	d, loaded = tsp.idToTrace.LoadOrStore(id, &sampling.TraceData{ 
        
           		Decisions:       initialDecisions, 
        
           		ArrivalTime:     time.Now(), 
        
           		SpanCount:       spanCount, 
        
           		ReceivedBatches: ptrace.NewTraces(), 
        
           	}) 
        
           }

So while the whole trace may not be present, the span that was sent to this specific collector should still be in the map.

mizhexiaoxiao · 2023-11-15T03:56:40Z

Increase the value of num_traces. The default value is 50000.

processors:
  tail_sampling:
    decision_wait: 120s
    num_traces: 50000

atulrautray · 2023-11-23T10:21:47Z

Increasing the value of num_traces fixed the issue. It seems that there was more data than the default num_traces of 50,000 could store in memory.

crobert-1 · 2023-11-23T18:14:43Z

Thanks for following up @atulrautray, appreciate knowing the solution for future reference!

Out of ignorance, were you setting the num_traces variable to some non-default value before? I didn't see it in the configuration you shared so I had assumed it was the default value.

atulrautray · 2023-11-24T09:13:25Z

Hi @crobert-1 , yes we had not set a custom value for num_traces.

…ace dropped errors (#29513) **Description:** <Describe what has changed.>  A [recent issue](#29024) was created when a user was getting a high value for the `otelcol_processor_tail_sampling_sampling_trace_dropped_too_early` metric. The solution was to increase the number of traces stored in memory to handle the load. This is done by using the `num_traces` configuration option. This was not clear from documentation or from the error metric's description. I'm not sure of what location is the best place to communicate the solution to this error. The two options in my mind are to either information to the README, or add more information to the metric description itself. Any guidance here is appreciated to know what would be most clear to users. --------- Co-authored-by: Juraci Paixão Kröhling <[email protected]>

atulrautray added the needs triage New item requiring triage label Nov 8, 2023

crobert-1 added processor/tailsampling Tail sampling processor bug Something isn't working labels Nov 8, 2023

github-actions bot mentioned this issue Nov 14, 2023

Weekly Report: 2023-11-07 - 2023-11-14 #29245

Closed

github-actions bot mentioned this issue Nov 21, 2023

Weekly Report: 2023-11-14 - 2023-11-21 #29422

Closed

atulrautray closed this as completed Nov 23, 2023

crobert-1 mentioned this issue Nov 27, 2023

[chore][processor/tailsampling] Add help information for resolving trace dropped errors #29513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

atulrautray commented Nov 8, 2023

github-actions bot commented Nov 8, 2023

crobert-1 commented Nov 9, 2023

bryan-aguilar commented Nov 9, 2023 •

edited

Loading

mizhexiaoxiao commented Nov 15, 2023

atulrautray commented Nov 23, 2023

crobert-1 commented Nov 23, 2023

atulrautray commented Nov 24, 2023

Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

Comments

atulrautray commented Nov 8, 2023

Component(s)

Describe the issue you're reporting

github-actions bot commented Nov 8, 2023

crobert-1 commented Nov 9, 2023

bryan-aguilar commented Nov 9, 2023 • edited Loading

mizhexiaoxiao commented Nov 15, 2023

atulrautray commented Nov 23, 2023

crobert-1 commented Nov 23, 2023

atulrautray commented Nov 24, 2023

bryan-aguilar commented Nov 9, 2023 •

edited

Loading