Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metricKeyToDimensions cache is printing many stack traces #9018

Closed
greut opened this issue Apr 4, 2022 · 25 comments · Fixed by #16601 or #17307
Closed

metricKeyToDimensions cache is printing many stack traces #9018

greut opened this issue Apr 4, 2022 · 25 comments · Fixed by #16601 or #17307
Labels
bug Something isn't working processor/spanmetrics Span Metrics processor

Comments

@greut
Copy link

greut commented Apr 4, 2022

Describe the bug

#6503 the additions of this LRU cache treats the missing key as an error, and prints a big stacktrace in the logs.

Steps to reproduce

Have either high dimensions or small dimensions_cache_size

What did you expect to see?

Either a metric for cache/hit ratio or a much lower log level, such as debug.

What did you see instead?

A lot of stacktraces

2022-04-04T03:10:26.171Z        error   [email protected]/processor.go:330   value not found in metricKeyToDimensions cache by key "copernic\x00getUserInfo\x00SPAN_KIND_SERVER\x00STATUS_CODE_UNSET"      {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}                                                                                     
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).collectCallMetrics                                                                                                                                                                                                                                                  
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:330                                                                                                                                                                                                                                                    
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).buildMetrics                                                                                                                                                                                                                                                        
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:258                                                                                                                                                                                                                                                    
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces                                                                                                                                                                                                                                                       
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:232                                                                                                                                                                                                                                                    
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1                                                                                                                                                                                                                                                                                             
        go.opentelemetry.io/[email protected]/processor/processorhelper/traces.go:70                                                                                                                                                                                                                                                                                         
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces                                                                                                                                                                                                                                                                                                       
        go.opentelemetry.io/[email protected]/consumer/traces.go:36                                                                                                                                                                                                                                                                                                          
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.(*jReceiver).PostSpans                                                                                                                                                                                                                                                                     
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:290                                                                                                                                                                                                                                                      
github.com/jaegertracing/jaeger/proto-gen/api_v2._CollectorService_PostSpans_Handler.func1                                                                                                                                                                                                                                                                                   
        github.com/jaegertracing/[email protected]/proto-gen/api_v2/collector.pb.go:208                                                                                                                                                                                                                                                                                         
go.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1                                                                                                                                                                                                                                                                                           
        go.opentelemetry.io/[email protected]/config/configgrpc/configgrpc.go:382                                                                                                     
google.golang.org/grpc.chainUnaryInterceptors.func1.1                                                                                                                                 
        google.golang.org/[email protected]/server.go:1116                                                                                                                                                                                                                                                                                                                        
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1                                                                              
        go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:325                                                                                                                                                                                                                                                               
google.golang.org/grpc.chainUnaryInterceptors.func1.1                                                                                                                                 
        google.golang.org/[email protected]/server.go:1119                                                                                                                                 
google.golang.org/grpc.chainUnaryInterceptors.func1                                                                                                                                   
        google.golang.org/[email protected]/server.go:1121                                                                                                                                 
github.com/jaegertracing/jaeger/proto-gen/api_v2._CollectorService_PostSpans_Handler                                                                                                  
        github.com/jaegertracing/[email protected]/proto-gen/api_v2/collector.pb.go:210                                                                                                  
google.golang.org/grpc.(*Server).processUnaryRPC                                                                                                                                      
        google.golang.org/[email protected]/server.go:1282                                                                                                                                 
google.golang.org/grpc.(*Server).handleStream                                                                                                                                         
        google.golang.org/[email protected]/server.go:1619                                                                                                                                 
google.golang.org/grpc.(*Server).serveStreams.func1.2                                                                                                                                 
        google.golang.org/[email protected]/server.go:921  

What version did you use?
Version: v0.48.0

What config did you use?
Config: (e.g. the yaml config file)

Environment
OS: Debian 11
Compiler(if manually compiled): vanilla docker image

Additional context

💙

@greut greut added the bug Something isn't working label Apr 4, 2022
@jpkrohling
Copy link
Member

cc @albertteoh

@mx-psi mx-psi added the processor/spanmetrics Span Metrics processor label Apr 7, 2022
@albertteoh
Copy link
Contributor

For posterity, please refer to this related discussion: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1649330166040589.

@gnozil
Copy link

gnozil commented Apr 25, 2022

Also hit this issue when tried to use Jaeger Monitor function in jeagertracing 1.33. @albertteoh, I cannot access the cloud-native slack channel, could you paste the discussion here ? Thanks.

@albertteoh
Copy link
Contributor

Hi all 👋 , I'd like instrument an exporter component to emit a counter metric each time there is a cache miss and avoid logging such an event (context: #9018).
Wondering if someone could please point me to an existing example where the otel-collector emits metrics about itself for a particular component?

Related issue: open-telemetry/opentelemetry-collector#4198

I think the proposal is to use otel-go to emit collector self-metrics, but the otel-go SDK is not quite ready for v1.x, which should contain the self-metrics feature; although there has been steady progress towards this goal it seems.

@jvilhuber
Copy link
Contributor

jvilhuber commented May 10, 2022

I'm also experiencing a significant flood of these errors (at times > 600/sec). If these are harmless, can they be lowered to info or debug?

If I read the comment above correctly, there's some changes coming that would get rid if the code emitting these errors (right?), but if this is still a ways out, reducing the level of the log seems easy and quick enough. As is, this is preventing me from using this feature (errors in our logs that aren't errors will just cause lots of misunderstanding).

@arun-shopify
Copy link
Contributor

I am seeing a bunch of these errors in our deployment as well, it would be good to know if these are harmless errors. I am wondering if the spans that hit this error go unprocessed. Looking at that part of the code briefly, it seems like that may be the case (correct me if I am wrong).

@logamanig
Copy link

Yes, @arun-shopify, the collector instance is not processing anymore traces and have to restart the collector instance to fix the error. It becomes a blocker to migrating to opentelemetry as it's breaking few time a day and to restart to fix it

@gnozil
Copy link

gnozil commented Jun 5, 2022

This is not only a log issue, the otel-collector instance will stop working and result traces loss. In our deployment, the otel-collector is used for extracting traces metrics and store to Prometheus, the issue broke the traces forwarding to jaeger collector.

@forestsword
Copy link

This is our experience as well. This will cause the collector to stop sending the traces further in the service pipeline. The healthcheck continues to return healthy resulting in 'stuck' collectors.

This is much more than a log entry. @greut or @albertteoh the issue doesn't indicate the severity of this bug. Can we modify this issue to say that it stops processing? Or should we create a new issue?

@greut
Copy link
Author

greut commented Jun 15, 2022

we didn't experience any service disruption, only verbose logs.

@albertteoh
Copy link
Contributor

Agree, the problem is more serious than just a high volume of logging. It involves the loss of trace data.

I haven't invested the time to root-cause this properly yet, but I suspect that the problem is the spanmetrics processor is creating the key for the metrics dimensions, caching, then quite reasonably, expecting it to still exist in the cache so that it can build the metrics with the metrics dimensions: processor.go#L233-L235.

However, because there's no lock between "setting" the key into cache and "getting" the key from cache, I think there's a potential race.

I suspect one issue is with the ContainsOrAdd method used here. Its documented comment reads:

// ContainsOrAdd checks if a key is in the cache without updating the
// recent-ness or deleting it for being stale, and if not, adds the value.
// Returns whether found and whether an eviction occurred.

I see two options so far:

  1. In cache(...), attempt to Get the key first (which updates recent-ness and should be a relatively cheap operation), and if it doesn't exist, then Add.
    I'm not 100% confident if that resolves the root-cause, but I feel that at least it's the more correct thing to do regardless.

  2. Another option is to lock the two lines of code that should be synchronous. This comes at the cost of longer CPU time spent within a mutex lock; but it should simplify the code as it removes the need for a number of locks in that call path.

Would be happy to get any feedback from folks about this or if anyone has looked at this problem at more depth than I have. 😄

@logamanig
Copy link

logamanig commented Jun 15, 2022

I agree with @albertteoh , it's the spanmetrics processor that's breaking the collector. Disabling spanmetrics is the only option to send traces without loss.

My environment:

Trace pipeline:
Otlp receiver
Spanmetrics and batch processor
Jaeger exporter

Metric pipeline:
Prometheus receiver that collects collector's own receives metrics
Prometheus exporter(it is also the exporter of spanmetrics)

As mentioned by @forestsword , collector's pod status is healthy though it can't send traces. The only way to identify is to check if trace backend received any traces recently. When having multiple replicas, traces are lost only for affected pods. So, reproduce this issue, I suggest to use single replica and check trace backend(for me jaeger) if traces are still being received after the error logged.

@albertteoh
Copy link
Contributor

PR to address this issue: #11149

@amoscatelli
Copy link

We have the same issue, I have to reboot otel collector everyday ...

@gnozil
Copy link

gnozil commented Jul 6, 2022

PR to address this issue: #11149

@albertteoh, when will this PR be resolved ? Thanks.

@albertteoh
Copy link
Contributor

@albertteoh, when will this PR be resolved ? Thanks.

@gnozil this was merged 2 days ago.

@nguyenhoaibao
Copy link

@albertteoh I'm using v0.59.0 but still seeing this issue.

@albertteoh
Copy link
Contributor

@albertteoh I'm using v0.59.0 but still seeing this issue.

Are you able to share your otel collector config and stack trace of the error please, @nguyenhoaibao

@albertteoh
Copy link
Contributor

Hi @greut (+ others), just following up to see if you've had a chance to upgrade and if this issue still exists as @nguyenhoaibao reported? I'd like to understand if there's still further investigation required and if so, if you're able to provide your configs and the error stack trace again please.

@greut greut changed the title metriyKeyToDimensions cache is printing many stack traces metricKeyToDimensions cache is printing many stack traces Sep 16, 2022
@greut
Copy link
Author

greut commented Sep 16, 2022

Since going from 0.53 to 0.59 10 days ago, that error is gone from our logs.

Screenshot 2022-09-16 at 14-33-24 Discover - OpenSearch Dashboards

EDIT btw thanks for the fixes!!

@forestsword
Copy link

forestsword commented Sep 20, 2022

A semi work-around for us was to run with 10k cache and to only process server and consumer spans using a filter (inspiration from #12615).

I can confirm it still fails on 0.59.0 but with the work-around it takes about a day for it to happen.

I've just removed my workarounds to reproduce it. There's no logs of an error. But I do see cpu spiking when this happen.

Screenshot 2022-09-20 at 14-45-57 - Grafana

Once it hits its limit it looks like it get throttled pretty dramatically. Then poof nothing. Spans begin to be refused by the otlp receiver and the accumulator stops logging anything. This line disappears after:

2022-09-20T11:58:31.237Z  debug [email protected]/accumulator.go:288 metric expired: latency {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}

It continues to remain healthy but refuses all in coming spans.

I've grabbed some pprof profiles if you're interesting @albertteoh I could give you them via slack. Could this be in issue in the prometheusexporter itself?

@Bruno-DaSilva
Copy link

Experiencing the exact same behaviour as above post. Ran into value not found in metricKeyToDimensions cache by key logs, upgraded versions to the 'fix' in v0.59.0, log messages disappeared. But it still stops processing incoming spans at a certain point.

Is it possible the fix just bandaided the log emission but didn't fix the root cause? I'm having to setup a job that restarts our deployment everyday or we lose all spans that we want processed into spanmetrics.

What's worse is it just silently fails. Nothing in the logs. But just doesn't process the spans. Not great!

@gnozil
Copy link

gnozil commented Nov 14, 2022

We finally removed the otel-colllector from our environment. Originally we supposed the traces metrics from otel-collector will be useful for Jeager-UI Monitor tab, however, it showed that Monitor tab in Jeager-UI is much less useful -- there is not filter function there and the metrics from different environments mixed, that made the data meaningless at all. As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore. After removing otel-collector, we have not seen trace loss any more, then we suspected the original trace loss issue was also because of otel-collector.

@albertteoh
Copy link
Contributor

albertteoh commented Nov 15, 2022

@Bruno-DaSilva @forestsword @greut the bug indeed hasn't yet been fixed and #11149 was just a speculative attempt at a fix without being able to reproduce the problem reliably, or perhaps there was a regression introduced. The spanmetrics processor was still propagating the error out of the processor, so I'm not sure why the error wasn't being logged anymore in v0.59.0.

Fortunately, I just found out a way to reliably reproduce this error by reducing the configurable dimensions_cache_size down to a small number (e.g. 2). I've opened a PR to address this (with thanks to @Frapschen).

@greut
Copy link
Author

greut commented Nov 15, 2022

As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore

Same, otelcontrib instead of the jaeger-agent on the host and then jaeger-collector. Hoping for a best future of the fully integrated jaeger-collector 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment