-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Thanks for those details @lucasoares. I agree, that It seems like something that's relatively straightforward for us to reproduce locally just with spanmetrics connector + prometheus server with the objective of eliminating mimir from the equation to confirm (or deny) if the problem somehow relates to the spanmetrics connector? You could use this working docker-compose setup with the spanmetrics connector + prometheus (+ jaeger) as a template: https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor. |
Thank you for the suggestion to set up a local test environment with the spanmetrics connector and Prometheus. We followed your instructions and the configuration worked perfectly in our local test environment. This helped us confirm that the spanmetrics connector and Prometheus configuration appear to be working as expected, and the initial issue we were facing may not be directly related to these components. However, we have a Homologation (HMG) environment that is identical to the production environment, but we have not been able to observe the same erroneous behavior in it. Below are the configuration files for the HMG environment: The loadbalancer: nameOverride: "" mode: "deployment" configMap: config:
connectors: extensions: image: command: nodeSelector:
ports: deployment. podAnnotations: replicaCount: 2 revisionHistoryLimit: 10 service: podDisruptionBudget: rollout: clusterRole:
clusterRoleBinding: The tail sampler: nameOverride: "" mode: "deployment" configMap: config: image: OpenTelemetry Collector executablecommand: nodeSelector:
ports: deployment. podAnnotations: replicaCount: 2 revisionHistoryLimit: 10 service: podDisruptionBudget: rollout: |
Could that issue be related with this problem? I'm using 0.83.0 version and I've upgraded to 0.88.0 to check it was fixed. I'll go back here to tell the results :D |
@luistilingue Have you been able to test this yet? (Or @lucasoares) |
@crobert-1 The issue still persists, event updating to 0.91.0. Could be related to prune caches like stated here grafana/agent#5271 and here #17306 ? That behavior is impacting our usage of otel-collectors :( |
@nijave Could you tell us if you could fix that behavior? |
I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in |
@lucasoares Can you confirm what @luistilingue has suggested resolves your issue? |
Yes |
I'm going to close the issue for now as it appears to be resolved, but let me know if there's anything else required here. |
Can you elaborate a bit more? We experience similar issues. |
https://grafana.com/docs/mimir/latest/configure/configure-high-availability-deduplication/ |
Component(s)
connector/spanmetrics, exporter/prometheusremotewrite
What happened?
Subject: Issue Report: Incorrect Behavior in OpenTelemetry Collector Spanmetrics
Issue Description:
We're facing a peculiar issue with the OpenTelemetry Collector's Spanmetrics connector and could use some help sorting it out.
Here's a quick rundown:
Problem:
Spanmetrics Configuration:
Issue Details:
increase(traces_spanmetrics_calls_total{service_name="my-service"}[5m])
shows a continuously increasing line, reaching 600 executions, and never returning to 0, even after a trace-free period.Observations:
The discrepancy is causing inflated values in application metrics, with rate showing over 100,000,000 spans/minute for an app generating 40,000 spans/minute.
We sought help on the Grafana Mimir Slack channel (link) without success, but since we haven't found issues with metrics generated by our own applications, it suggests the problem lies within the OpenTelemetry Collector.
Screenshots:
Another example of the metric being incorrect after the application no longer generates new spans:
If you need more details or logs, just let us know!
Collector version
0.83.0
Environment information
Environment
Kubernetes using official helm-chart:
OpenTelemetry Collector configuration
The loadbalancer:
The tail sampler:
The text was updated successfully, but these errors were encountered: