-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metricKeyToDimensions cache is printing many stack traces #9018
Comments
cc @albertteoh |
For posterity, please refer to this related discussion: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1649330166040589. |
Also hit this issue when tried to use Jaeger Monitor function in jeagertracing 1.33. @albertteoh, I cannot access the cloud-native slack channel, could you paste the discussion here ? Thanks. |
Related issue: open-telemetry/opentelemetry-collector#4198 I think the proposal is to use otel-go to emit collector self-metrics, but the otel-go SDK is not quite ready for v1.x, which should contain the self-metrics feature; although there has been steady progress towards this goal it seems. |
I'm also experiencing a significant flood of these errors (at times > 600/sec). If these are harmless, can they be lowered to info or debug? If I read the comment above correctly, there's some changes coming that would get rid if the code emitting these errors (right?), but if this is still a ways out, reducing the level of the log seems easy and quick enough. As is, this is preventing me from using this feature (errors in our logs that aren't errors will just cause lots of misunderstanding). |
I am seeing a bunch of these errors in our deployment as well, it would be good to know if these are harmless errors. I am wondering if the spans that hit this error go unprocessed. Looking at that part of the code briefly, it seems like that may be the case (correct me if I am wrong). |
Yes, @arun-shopify, the collector instance is not processing anymore traces and have to restart the collector instance to fix the error. It becomes a blocker to migrating to opentelemetry as it's breaking few time a day and to restart to fix it |
This is not only a log issue, the otel-collector instance will stop working and result traces loss. In our deployment, the otel-collector is used for extracting traces metrics and store to Prometheus, the issue broke the traces forwarding to jaeger collector. |
This is our experience as well. This will cause the collector to stop sending the traces further in the service pipeline. The healthcheck continues to return healthy resulting in 'stuck' collectors. This is much more than a log entry. @greut or @albertteoh the issue doesn't indicate the severity of this bug. Can we modify this issue to say that it stops processing? Or should we create a new issue? |
we didn't experience any service disruption, only verbose logs. |
Agree, the problem is more serious than just a high volume of logging. It involves the loss of trace data. I haven't invested the time to root-cause this properly yet, but I suspect that the problem is the spanmetrics processor is creating the key for the metrics dimensions, caching, then quite reasonably, expecting it to still exist in the cache so that it can build the metrics with the metrics dimensions: processor.go#L233-L235. However, because there's no lock between "setting" the key into cache and "getting" the key from cache, I think there's a potential race. I suspect one issue is with the
I see two options so far:
Would be happy to get any feedback from folks about this or if anyone has looked at this problem at more depth than I have. 😄 |
I agree with @albertteoh , it's the spanmetrics processor that's breaking the collector. Disabling spanmetrics is the only option to send traces without loss. My environment: Trace pipeline: Metric pipeline: As mentioned by @forestsword , collector's pod status is healthy though it can't send traces. The only way to identify is to check if trace backend received any traces recently. When having multiple replicas, traces are lost only for affected pods. So, reproduce this issue, I suggest to use single replica and check trace backend(for me jaeger) if traces are still being received after the error logged. |
PR to address this issue: #11149 |
We have the same issue, I have to reboot otel collector everyday ... |
@albertteoh, when will this PR be resolved ? Thanks. |
@gnozil this was merged 2 days ago. |
@albertteoh I'm using v0.59.0 but still seeing this issue. |
Are you able to share your otel collector config and stack trace of the error please, @nguyenhoaibao |
Hi @greut (+ others), just following up to see if you've had a chance to upgrade and if this issue still exists as @nguyenhoaibao reported? I'd like to understand if there's still further investigation required and if so, if you're able to provide your configs and the error stack trace again please. |
A semi work-around for us was to run with 10k cache and to only process server and consumer spans using a filter (inspiration from #12615). I can confirm it still fails on 0.59.0 but with the work-around it takes about a day for it to happen. I've just removed my workarounds to reproduce it. There's no logs of an error. But I do see cpu spiking when this happen. Once it hits its limit it looks like it get throttled pretty dramatically. Then poof nothing. Spans begin to be refused by the otlp receiver and the accumulator stops logging anything. This line disappears after:
It continues to remain healthy but refuses all in coming spans. I've grabbed some pprof profiles if you're interesting @albertteoh I could give you them via slack. Could this be in issue in the prometheusexporter itself? |
Experiencing the exact same behaviour as above post. Ran into Is it possible the fix just bandaided the log emission but didn't fix the root cause? I'm having to setup a job that restarts our deployment everyday or we lose all spans that we want processed into spanmetrics. What's worse is it just silently fails. Nothing in the logs. But just doesn't process the spans. Not great! |
We finally removed the otel-colllector from our environment. Originally we supposed the traces metrics from otel-collector will be useful for Jeager-UI Monitor tab, however, it showed that Monitor tab in Jeager-UI is much less useful -- there is not filter function there and the metrics from different environments mixed, that made the data meaningless at all. As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore. After removing otel-collector, we have not seen trace loss any more, then we suspected the original trace loss issue was also because of otel-collector. |
@Bruno-DaSilva @forestsword @greut the bug indeed hasn't yet been fixed and #11149 was just a speculative attempt at a fix without being able to reproduce the problem reliably, or perhaps there was a regression introduced. The spanmetrics processor was still propagating the error out of the processor, so I'm not sure why the error wasn't being logged anymore in v0.59.0. Fortunately, I just found out a way to reliably reproduce this error by reducing the configurable |
Same, otelcontrib instead of the jaeger-agent on the host and then jaeger-collector. Hoping for a best future of the fully integrated jaeger-collector 2 |
Describe the bug
#6503 the additions of this LRU cache treats the missing key as an error, and prints a big stacktrace in the logs.
Steps to reproduce
Have either high dimensions or small
dimensions_cache_size
What did you expect to see?
Either a metric for cache/hit ratio or a much lower log level, such as debug.
What did you see instead?
A lot of stacktraces
What version did you use?
Version: v0.48.0
What config did you use?
Config: (e.g. the yaml config file)
Environment
OS: Debian 11
Compiler(if manually compiled): vanilla docker image
Additional context
💙
The text was updated successfully, but these errors were encountered: