metricKeyToDimensions cache is printing many stack traces #9018

greut · 2022-04-04T11:13:36Z

Describe the bug

#6503 the additions of this LRU cache treats the missing key as an error, and prints a big stacktrace in the logs.

Steps to reproduce

Have either high dimensions or small dimensions_cache_size

What did you expect to see?

Either a metric for cache/hit ratio or a much lower log level, such as debug.

What did you see instead?

A lot of stacktraces

2022-04-04T03:10:26.171Z        error   [email protected]/processor.go:330   value not found in metricKeyToDimensions cache by key "copernic\x00getUserInfo\x00SPAN_KIND_SERVER\x00STATUS_CODE_UNSET"      {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}                                                                                     
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).collectCallMetrics                                                                                                                                                                                                                                                  
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:330                                                                                                                                                                                                                                                    
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).buildMetrics                                                                                                                                                                                                                                                        
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:258                                                                                                                                                                                                                                                    
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces                                                                                                                                                                                                                                                       
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:232                                                                                                                                                                                                                                                    
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1                                                                                                                                                                                                                                                                                             
        go.opentelemetry.io/[email protected]/processor/processorhelper/traces.go:70                                                                                                                                                                                                                                                                                         
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces                                                                                                                                                                                                                                                                                                       
        go.opentelemetry.io/[email protected]/consumer/traces.go:36                                                                                                                                                                                                                                                                                                          
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.(*jReceiver).PostSpans                                                                                                                                                                                                                                                                     
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:290                                                                                                                                                                                                                                                      
github.com/jaegertracing/jaeger/proto-gen/api_v2._CollectorService_PostSpans_Handler.func1                                                                                                                                                                                                                                                                                   
        github.com/jaegertracing/[email protected]/proto-gen/api_v2/collector.pb.go:208                                                                                                                                                                                                                                                                                         
go.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1                                                                                                                                                                                                                                                                                           
        go.opentelemetry.io/[email protected]/config/configgrpc/configgrpc.go:382                                                                                                     
google.golang.org/grpc.chainUnaryInterceptors.func1.1                                                                                                                                 
        google.golang.org/[email protected]/server.go:1116                                                                                                                                                                                                                                                                                                                        
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1                                                                              
        go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:325                                                                                                                                                                                                                                                               
google.golang.org/grpc.chainUnaryInterceptors.func1.1                                                                                                                                 
        google.golang.org/[email protected]/server.go:1119                                                                                                                                 
google.golang.org/grpc.chainUnaryInterceptors.func1                                                                                                                                   
        google.golang.org/[email protected]/server.go:1121                                                                                                                                 
github.com/jaegertracing/jaeger/proto-gen/api_v2._CollectorService_PostSpans_Handler                                                                                                  
        github.com/jaegertracing/[email protected]/proto-gen/api_v2/collector.pb.go:210                                                                                                  
google.golang.org/grpc.(*Server).processUnaryRPC                                                                                                                                      
        google.golang.org/[email protected]/server.go:1282                                                                                                                                 
google.golang.org/grpc.(*Server).handleStream                                                                                                                                         
        google.golang.org/[email protected]/server.go:1619                                                                                                                                 
google.golang.org/grpc.(*Server).serveStreams.func1.2                                                                                                                                 
        google.golang.org/[email protected]/server.go:921

What version did you use?
Version: v0.48.0

What config did you use?
Config: (e.g. the yaml config file)

Environment
OS: Debian 11
Compiler(if manually compiled): vanilla docker image

Additional context

💙

The text was updated successfully, but these errors were encountered:

jpkrohling · 2022-04-06T12:57:17Z

cc @albertteoh

albertteoh · 2022-04-13T10:27:35Z

For posterity, please refer to this related discussion: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1649330166040589.

gnozil · 2022-04-25T15:10:41Z

Also hit this issue when tried to use Jaeger Monitor function in jeagertracing 1.33. @albertteoh, I cannot access the cloud-native slack channel, could you paste the discussion here ? Thanks.

albertteoh · 2022-04-27T10:59:42Z

Hi all 👋 , I'd like instrument an exporter component to emit a counter metric each time there is a cache miss and avoid logging such an event (context: #9018).
Wondering if someone could please point me to an existing example where the otel-collector emits metrics about itself for a particular component?

Related issue: open-telemetry/opentelemetry-collector#4198

I think the proposal is to use otel-go to emit collector self-metrics, but the otel-go SDK is not quite ready for v1.x, which should contain the self-metrics feature; although there has been steady progress towards this goal it seems.

jvilhuber · 2022-05-10T06:28:02Z

I'm also experiencing a significant flood of these errors (at times > 600/sec). If these are harmless, can they be lowered to info or debug?

If I read the comment above correctly, there's some changes coming that would get rid if the code emitting these errors (right?), but if this is still a ways out, reducing the level of the log seems easy and quick enough. As is, this is preventing me from using this feature (errors in our logs that aren't errors will just cause lots of misunderstanding).

arun-shopify · 2022-05-31T14:39:27Z

I am seeing a bunch of these errors in our deployment as well, it would be good to know if these are harmless errors. I am wondering if the spans that hit this error go unprocessed. Looking at that part of the code briefly, it seems like that may be the case (correct me if I am wrong).

logamanig · 2022-06-02T14:00:59Z

Yes, @arun-shopify, the collector instance is not processing anymore traces and have to restart the collector instance to fix the error. It becomes a blocker to migrating to opentelemetry as it's breaking few time a day and to restart to fix it

gnozil · 2022-06-05T15:39:45Z

This is not only a log issue, the otel-collector instance will stop working and result traces loss. In our deployment, the otel-collector is used for extracting traces metrics and store to Prometheus, the issue broke the traces forwarding to jaeger collector.

forestsword · 2022-06-15T11:18:07Z

This is our experience as well. This will cause the collector to stop sending the traces further in the service pipeline. The healthcheck continues to return healthy resulting in 'stuck' collectors.

This is much more than a log entry. @greut or @albertteoh the issue doesn't indicate the severity of this bug. Can we modify this issue to say that it stops processing? Or should we create a new issue?

greut · 2022-06-15T12:09:27Z

we didn't experience any service disruption, only verbose logs.

albertteoh · 2022-06-15T12:11:46Z

Agree, the problem is more serious than just a high volume of logging. It involves the loss of trace data.

I haven't invested the time to root-cause this properly yet, but I suspect that the problem is the spanmetrics processor is creating the key for the metrics dimensions, caching, then quite reasonably, expecting it to still exist in the cache so that it can build the metrics with the metrics dimensions: processor.go#L233-L235.

However, because there's no lock between "setting" the key into cache and "getting" the key from cache, I think there's a potential race.

I suspect one issue is with the ContainsOrAdd method used here. Its documented comment reads:

// ContainsOrAdd checks if a key is in the cache without updating the
// recent-ness or deleting it for being stale, and if not, adds the value.
// Returns whether found and whether an eviction occurred.

I see two options so far:

In cache(...), attempt to Get the key first (which updates recent-ness and should be a relatively cheap operation), and if it doesn't exist, then Add.
I'm not 100% confident if that resolves the root-cause, but I feel that at least it's the more correct thing to do regardless.
Another option is to lock the two lines of code that should be synchronous. This comes at the cost of longer CPU time spent within a mutex lock; but it should simplify the code as it removes the need for a number of locks in that call path.

Would be happy to get any feedback from folks about this or if anyone has looked at this problem at more depth than I have. 😄

logamanig · 2022-06-15T13:27:40Z

I agree with @albertteoh , it's the spanmetrics processor that's breaking the collector. Disabling spanmetrics is the only option to send traces without loss.

My environment:

Trace pipeline:
Otlp receiver
Spanmetrics and batch processor
Jaeger exporter

Metric pipeline:
Prometheus receiver that collects collector's own receives metrics
Prometheus exporter(it is also the exporter of spanmetrics)

As mentioned by @forestsword , collector's pod status is healthy though it can't send traces. The only way to identify is to check if trace backend received any traces recently. When having multiple replicas, traces are lost only for affected pods. So, reproduce this issue, I suggest to use single replica and check trace backend(for me jaeger) if traces are still being received after the error logged.

albertteoh · 2022-06-23T10:15:57Z

PR to address this issue: #11149

amoscatelli · 2022-07-06T07:49:30Z

We have the same issue, I have to reboot otel collector everyday ...

gnozil · 2022-07-06T08:47:25Z

PR to address this issue: #11149

@albertteoh, when will this PR be resolved ? Thanks.

albertteoh · 2022-07-14T12:49:20Z

@albertteoh, when will this PR be resolved ? Thanks.

@gnozil this was merged 2 days ago.

nguyenhoaibao · 2022-09-09T07:45:33Z

@albertteoh I'm using v0.59.0 but still seeing this issue.

albertteoh · 2022-09-09T12:30:35Z

@albertteoh I'm using v0.59.0 but still seeing this issue.

Are you able to share your otel collector config and stack trace of the error please, @nguyenhoaibao

albertteoh · 2022-09-16T11:22:15Z

Hi @greut (+ others), just following up to see if you've had a chance to upgrade and if this issue still exists as @nguyenhoaibao reported? I'd like to understand if there's still further investigation required and if so, if you're able to provide your configs and the error stack trace again please.

greut · 2022-09-16T12:34:53Z

Since going from 0.53 to 0.59 10 days ago, that error is gone from our logs.

EDIT btw thanks for the fixes!!

forestsword · 2022-09-20T14:53:01Z

A semi work-around for us was to run with 10k cache and to only process server and consumer spans using a filter (inspiration from #12615).

I can confirm it still fails on 0.59.0 but with the work-around it takes about a day for it to happen.

I've just removed my workarounds to reproduce it. There's no logs of an error. But I do see cpu spiking when this happen.

Once it hits its limit it looks like it get throttled pretty dramatically. Then poof nothing. Spans begin to be refused by the otlp receiver and the accumulator stops logging anything. This line disappears after:

2022-09-20T11:58:31.237Z  debug [email protected]/accumulator.go:288 metric expired: latency {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}

It continues to remain healthy but refuses all in coming spans.

I've grabbed some pprof profiles if you're interesting @albertteoh I could give you them via slack. Could this be in issue in the prometheusexporter itself?

Bruno-DaSilva · 2022-11-12T05:08:02Z

Experiencing the exact same behaviour as above post. Ran into value not found in metricKeyToDimensions cache by key logs, upgraded versions to the 'fix' in v0.59.0, log messages disappeared. But it still stops processing incoming spans at a certain point.

Is it possible the fix just bandaided the log emission but didn't fix the root cause? I'm having to setup a job that restarts our deployment everyday or we lose all spans that we want processed into spanmetrics.

What's worse is it just silently fails. Nothing in the logs. But just doesn't process the spans. Not great!

gnozil · 2022-11-14T02:05:45Z

We finally removed the otel-colllector from our environment. Originally we supposed the traces metrics from otel-collector will be useful for Jeager-UI Monitor tab, however, it showed that Monitor tab in Jeager-UI is much less useful -- there is not filter function there and the metrics from different environments mixed, that made the data meaningless at all. As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore. After removing otel-collector, we have not seen trace loss any more, then we suspected the original trace loss issue was also because of otel-collector.

albertteoh · 2022-11-15T12:35:42Z

@Bruno-DaSilva @forestsword @greut the bug indeed hasn't yet been fixed and #11149 was just a speculative attempt at a fix without being able to reproduce the problem reliably, or perhaps there was a regression introduced. The spanmetrics processor was still propagating the error out of the processor, so I'm not sure why the error wasn't being logged anymore in v0.59.0.

Fortunately, I just found out a way to reliably reproduce this error by reducing the configurable dimensions_cache_size down to a small number (e.g. 2). I've opened a PR to address this (with thanks to @Frapschen).

greut · 2022-11-15T15:10:49Z

As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore

Same, otelcontrib instead of the jaeger-agent on the host and then jaeger-collector. Hoping for a best future of the fully integrated jaeger-collector 2

greut added the bug Something isn't working label Apr 4, 2022

mx-psi added the processor/spanmetrics Span Metrics processor label Apr 7, 2022

albertteoh mentioned this issue Jun 18, 2022

[processor/spanmetrics] Fix concurrency bug causing premature key eviction #11149

Merged

albertteoh mentioned this issue Jul 14, 2022

Remove irrelevant comment #12427

Merged

daper mentioned this issue Aug 20, 2022

Update go.opentelemetry.io/collector to v0.58.0 grafana/agent#2027

Closed

greut changed the title ~~metriyKeyToDimensions cache is printing many stack traces~~ metricKeyToDimensions cache is printing many stack traces Sep 16, 2022

albertteoh mentioned this issue Nov 15, 2022

[processor/spanmetrics] Fix 'value not found in metricKeyToDimensions' #16311

Closed

albertteoh mentioned this issue Dec 3, 2022

[processor/spanmetrics] Emit metrics every n seconds #16601

Merged

djaglowski closed this as completed in #16601 Dec 14, 2022

jojotong mentioned this issue Dec 23, 2022

💣 otel span metrics stop working kubegems/kubegems#368

Closed

albertteoh mentioned this issue Dec 30, 2022

Emit every n seconds #17307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metricKeyToDimensions cache is printing many stack traces #9018

metricKeyToDimensions cache is printing many stack traces #9018

greut commented Apr 4, 2022

jpkrohling commented Apr 6, 2022

albertteoh commented Apr 13, 2022

gnozil commented Apr 25, 2022

albertteoh commented Apr 27, 2022

jvilhuber commented May 10, 2022 •

edited

Loading

arun-shopify commented May 31, 2022

logamanig commented Jun 2, 2022

gnozil commented Jun 5, 2022

forestsword commented Jun 15, 2022

greut commented Jun 15, 2022

albertteoh commented Jun 15, 2022

logamanig commented Jun 15, 2022 •

edited

Loading

albertteoh commented Jun 23, 2022

amoscatelli commented Jul 6, 2022

gnozil commented Jul 6, 2022 •

edited

Loading

albertteoh commented Jul 14, 2022

nguyenhoaibao commented Sep 9, 2022

albertteoh commented Sep 9, 2022

albertteoh commented Sep 16, 2022

greut commented Sep 16, 2022 •

edited

Loading

forestsword commented Sep 20, 2022 •

edited

Loading

Bruno-DaSilva commented Nov 12, 2022

gnozil commented Nov 14, 2022 •

edited

Loading

albertteoh commented Nov 15, 2022 •

edited

Loading

greut commented Nov 15, 2022

metricKeyToDimensions cache is printing many stack traces #9018

metricKeyToDimensions cache is printing many stack traces #9018

Comments

greut commented Apr 4, 2022

jpkrohling commented Apr 6, 2022

albertteoh commented Apr 13, 2022

gnozil commented Apr 25, 2022

albertteoh commented Apr 27, 2022

jvilhuber commented May 10, 2022 • edited Loading

arun-shopify commented May 31, 2022

logamanig commented Jun 2, 2022

gnozil commented Jun 5, 2022

forestsword commented Jun 15, 2022

greut commented Jun 15, 2022

albertteoh commented Jun 15, 2022

logamanig commented Jun 15, 2022 • edited Loading

albertteoh commented Jun 23, 2022

amoscatelli commented Jul 6, 2022

gnozil commented Jul 6, 2022 • edited Loading

albertteoh commented Jul 14, 2022

nguyenhoaibao commented Sep 9, 2022

albertteoh commented Sep 9, 2022

albertteoh commented Sep 16, 2022

greut commented Sep 16, 2022 • edited Loading

forestsword commented Sep 20, 2022 • edited Loading

Bruno-DaSilva commented Nov 12, 2022

gnozil commented Nov 14, 2022 • edited Loading

albertteoh commented Nov 15, 2022 • edited Loading

greut commented Nov 15, 2022

jvilhuber commented May 10, 2022 •

edited

Loading

logamanig commented Jun 15, 2022 •

edited

Loading

gnozil commented Jul 6, 2022 •

edited

Loading

greut commented Sep 16, 2022 •

edited

Loading

forestsword commented Sep 20, 2022 •

edited

Loading

gnozil commented Nov 14, 2022 •

edited

Loading

albertteoh commented Nov 15, 2022 •

edited

Loading