Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

Closed
atulrautray opened this issue Nov 8, 2023 · 7 comments
Closed

Increase in tail_sampling_sampling_trace_dropped_too_early count #29024

atulrautray opened this issue Nov 8, 2023 · 7 comments
Labels
bug Something isn't working needs triage New item requiring triage processor/tailsampling Tail sampling processor

Comments

@atulrautray
Copy link

Component(s)

cmd/otelcontribcol

Describe the issue you're reporting

We notice that our Otel collector started dropping spans too early in a huge number since last couple of weeks
(Metric - otelcol_processor_tail_sampling_sampling_trace_dropped_too_early)

Our setup is we have otel collector as a loadbalancing exporter in the front, which exports to another otel collector (act as a sampler).

We see that in both uat and prod, the spans are started getting dropped(too early) in a significant numbers at the same time.
We didn’t make any changes in our sampling policies in the last 1 month. And this span drops affect all apps, not specific to any.
We hit the roadblock in troubleshooting this.

Things we tried so far-

  1. Verified no infrastructure/network changes performed in the running clusters
  2. Verified no changes in the applications, app performance in terms of response time, latency
  3. Increased the decision_wait to 5mins, no effect in the span drops
  4. Updated the collector version to latest release (0.88.0), no effect

Appreciate any inputs on how to debug further and fix this issue.
Below is our sampling policy. (This policy in place for the last 3 months and traces used to be sampled fine)

    processors:
      batch/1:
        timeout: 10s
      batch/2:
        timeout: 10s

      tail_sampling:
        decision_wait: 120s
        policies:
          [
              {
                name: service-inclusion-policy,
                type: string_attribute,
                string_attribute: {key: service.name, values: [app1, app2, app3]}
              },
              {
                name: otel-error-code-policy,
                type: status_code,
                status_code: {status_codes: [ERROR]}
              },
              {
                name: http-status-code-policy,
                type: numeric_attribute,
                numeric_attribute: { key: http.status_code, min_value: 400, max_value: 600 }
              },              
              {
                name: probabilistic-ok-status-inclusion-policy,
                type: and,
                and: {
                  and_sub_policy:
                  [
                    {
                      name: otel-ok-status-code-policy,
                      type: status_code,
                      status_code: { status_codes: [OK] }
                    },
                    {
                      name: probabilistic-policy,
                      type: probabilistic,
                      probabilistic: {sampling_percentage: 10}
                    }
                  ]
                }
              }
          ]

These are the metrics-
image

image

image

@atulrautray atulrautray added the needs triage New item requiring triage label Nov 8, 2023
@crobert-1 crobert-1 added processor/tailsampling Tail sampling processor bug Something isn't working labels Nov 8, 2023
Copy link
Contributor

github-actions bot commented Nov 8, 2023

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

I don't have much context here, so I'm going to write my thought process here as a reference. I might be missing something obvious as well.

Investigation
The failure is happening here in code:

d, ok := tsp.idToTrace.Load(id)
if !ok {
metrics.idNotFoundOnMapCount++
continue
}

This is simply saying that the span's trace ID can't be found or was dropped from the cache. This could be a timing issue as you've referenced, but since you've tested longer wait times I'm wondering if it's related to routing since you have multiple collectors involved here.

From the tailsampling processor's README:

All spans for a given trace MUST be received by the same collector instance for effective sampling decisions.

This led me to wonder if the load balancing exporter could possibly be the problem here, but it does look like the exporter was written to accommodate this use case. One comment may be relevant:

if routing stability is important for your use case and your list of backends are constantly changing, consider using the groupbytrace processor. This way, traces are dispatched atomically to this exporter, and the same decision about the backend is made for the trace as a whole.

Request
Can you share the full configuration of the collector you're using the load balancing exporter in?

@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Nov 9, 2023

I was thinking similar at one point also @crobert-1. Although, from what I understand even if the load balancing layer is sending spans from a single trace id to multiple TSP collectors the span should still be stored when processed. You should just get bad sampling decisions.

d, loaded := tsp.idToTrace.Load(id)
if !loaded {
spanCount := &atomic.Int64{}
spanCount.Store(lenSpans)
d, loaded = tsp.idToTrace.LoadOrStore(id, &sampling.TraceData{
Decisions: initialDecisions,
ArrivalTime: time.Now(),
SpanCount: spanCount,
ReceivedBatches: ptrace.NewTraces(),
})
}

So while the whole trace may not be present, the span that was sent to this specific collector should still be in the map.

@mizhexiaoxiao
Copy link
Contributor

Increase the value of num_traces. The default value is 50000.

processors:
  tail_sampling:
    decision_wait: 120s
    num_traces: 50000

@atulrautray
Copy link
Author

Increasing the value of num_traces fixed the issue. It seems that there was more data than the default num_traces of 50,000 could store in memory.

@crobert-1
Copy link
Member

Thanks for following up @atulrautray, appreciate knowing the solution for future reference!

Out of ignorance, were you setting the num_traces variable to some non-default value before? I didn't see it in the configuration you shared so I had assumed it was the default value.

@atulrautray
Copy link
Author

Hi @crobert-1 , yes we had not set a custom value for num_traces.

jpkrohling added a commit that referenced this issue Dec 5, 2023
…ace dropped errors (#29513)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
A [recent
issue](#29024)
was created when a user was getting a high value for the
`otelcol_processor_tail_sampling_sampling_trace_dropped_too_early`
metric. The solution was to increase the number of traces stored in
memory to handle the load. This is done by using the `num_traces`
configuration option. This was not clear from documentation or from the
error metric's description.

I'm not sure of what location is the best place to communicate the
solution to this error. The two options in my mind are to either
information to the README, or add more information to the metric
description itself. Any guidance here is appreciated to know what would
be most clear to users.

---------

Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

4 participants