Inconsistent results between prometheusexporter and prometheusremotewrite #4975

albertteoh · 2021-05-08T12:47:33Z

Describe the bug
When using prometheusremotewrite to export metrics to M3, I'm getting latencies that are over 200 years when queried from M3.

However, when scraping these metrics from prometheus, the latencies look correct.

Am I configuring something incorrectly?

Steps to reproduce

Run opentelemetry-collector-contrib with the config attached in this issue.
Send spans to the collector using omnition/synthetic-load-generator, and with the spanmetrics processor sending metrics to a prometheus exporter.
Have a prometheus server running to scrape metrics from the prometheus exporter.
Within opentelemetry collector, a prometheus receiver will also scrape metrics from the same prometheus exporter above and send it to the configured prometheusremotewrite exporter that will write to an M3 instance.
Setup grafana to query both data sources (prometheus server and M3) and graph both timeseries with the same query.

What did you expect to see?
Identical 95th percentile latencies, or at least close enough to one another.

What did you see instead?
Latencies from M3 were over 200 years, whereas from Prometheus, they were a more sensible ~200ms.

Here are two screenshots of the same query executed against Prometheus and M3 data sources respectively:

Prometheus

M3

To reduce the search space by ruling out M3 and spanmetrics processor as possible causes, I also checked the logs (these are from an earlier run):

Here, I log the total latency_count as well as the latency_bucket counts within spanmetrics processor. I've taken logs from two different times, 10 seconds apart and as you can see, the count is consistent with the sum of bucket_counts:

2021-05-08T18:54:20.947+1000    debug   [email protected]/processor.go:258        Latency metrics {"kind": "processor", "name": "spanmetrics", "key": "frontend\u0000/checkout\u0000SPAN_KIND_CLIENT\u0000STATUS_CODE_UNSET", "count": 2, "bucket_count": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]}

2021-05-08T18:54:30.958+1000    debug   [email protected]/processor.go:258        Latency metrics {"kind": "processor", "name": "spanmetrics", "key": "frontend\u0000/checkout\u0000SPAN_KIND_CLIENT\u0000STATUS_CODE_UNSET", "count": 3, "bucket_count": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0]}

However, this is the log output from the last metrics pipeline in the config below, i.e.:

    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, logging]

As you can see the total count is 1 but the bucket count total is 2 + 1 = 3, and so I believe the +Inf tries to account for this discrepancy, resulting in -2 represented as the uint64 equivalent of 18446744073709551614. I have also seen versions in logs where the total count > sum of bucket counts, leading to a "positive" spillover +Inf count.

HistogramDataPoints open-telemetry/opentelemetry-collector#12
Data point labels:
     -> operation: /checkout
     -> service_name: frontend
     -> span_kind: SPAN_KIND_CLIENT
     -> status_code: STATUS_CODE_UNSET
StartTimestamp: 2021-05-08 08:54:22.433 +0000 UTC
Timestamp: 2021-05-08 08:54:32.437 +0000 UTC
Count: 1
Sum: 1708.000000
ExplicitBounds #0: 2.000000
ExplicitBounds open-telemetry/opentelemetry-collector#1: 6.000000
ExplicitBounds open-telemetry/opentelemetry-collector#2: 10.000000
ExplicitBounds open-telemetry/opentelemetry-collector#3: 100.000000
ExplicitBounds open-telemetry/opentelemetry-collector#4: 250.000000
ExplicitBounds open-telemetry/community#39: 300.000000
ExplicitBounds open-telemetry/opentelemetry-collector#6: 400.000000
ExplicitBounds open-telemetry/opentelemetry-collector#7: 800.000000
ExplicitBounds open-telemetry/opentelemetry-collector#8: 1000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#9: 1400.000000
ExplicitBounds open-telemetry/opentelemetry-collector#10: 2000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#11: 5000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#12: 15000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#13: 30000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#14: 120000.000000
ExplicitBounds open-telemetry/opentelemetry-collector#15: 9223372036854.000000
Buckets #0, Count: 0
Buckets open-telemetry/opentelemetry-collector#1, Count: 0
Buckets open-telemetry/opentelemetry-collector#2, Count: 0
Buckets open-telemetry/opentelemetry-collector#3, Count: 0
Buckets open-telemetry/opentelemetry-collector#4, Count: 0
Buckets open-telemetry/community#39, Count: 0
Buckets open-telemetry/opentelemetry-collector#6, Count: 0
Buckets open-telemetry/opentelemetry-collector#7, Count: 0
Buckets open-telemetry/opentelemetry-collector#8, Count: 0
Buckets open-telemetry/opentelemetry-collector#9, Count: 0
Buckets open-telemetry/opentelemetry-collector#10, Count: 2
Buckets open-telemetry/opentelemetry-collector#11, Count: 1
Buckets open-telemetry/opentelemetry-collector#12, Count: 0
Buckets open-telemetry/opentelemetry-collector#13, Count: 0
Buckets open-telemetry/opentelemetry-collector#14, Count: 0
Buckets open-telemetry/opentelemetry-collector#15, Count: 0
Buckets open-telemetry/opentelemetry-collector#16, Count: 18446744073709551614

What version did you use?
Version: opentelemetry-collector-contrib@master

What config did you use?
Config: (e.g. the yaml config file)

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'atm'
        scrape_interval: 10s
        static_configs:
        - targets: [ "0.0.0.0:8889" ]

  jaeger:
    protocols:
      thrift_http:
        endpoint: "0.0.0.0:14278"

  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:65535"

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  logging:
    loglevel: debug

  jaeger:
    endpoint: "localhost:14250"
    insecure: true

  prometheusremotewrite:
    endpoint: "http://localhost:7201/api/v1/prom/remote/write"
    insecure: true

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    insecure: true


processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    # default (in ms): [2, 4, 6, 8, 10, 50, 100, 200, 400, 800, 1000, 1400, 2000, 5000, 10_000, 15_000]
    latency_histogram_buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 300ms, 400ms, 800ms, 1s, 1.4s, 2s, 5s, 15s, 30s, 120s]

extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [spanmetrics]
      exporters: [jaeger]
    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus, logging]
    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, logging]

Environment
OS: MacOS
Compiler(if manually compiled): go 1.16

Additional context
cc @bogdandrutu

The text was updated successfully, but these errors were encountered:

ankitnayan · 2021-10-17T18:42:28Z

@albertteoh Could you make this work? Or any workaround?

albertteoh · 2021-12-03T09:37:48Z

Hi @ankitnayan, my workaround was to filter out any latencies > 24 hours, which isn't nice but it does the job for my use case at least.

luistilingue · 2022-03-10T21:55:53Z

@albertteoh I have a similar issue, but using Prometheus.

spanMetricsProcessor is creating such bucket le="9.223372036854775e+12":

latency_bucket{http_status_code="200",operation="/health/xxxxx/health/**",service_name="xxxxx",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="9.223372036854775e+12"} 1

I guess the code of spanMetricsProcessor need to deal with golang number conversion when using float64.

Please have a look at these lines https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L140

I've simulated here and the behavior was the same.

So, when Prometheus read that "number" it behavior in such way. I don't know if this doc can help.

ankitnayan · 2022-03-11T04:09:01Z

@luistilingue I upgraded to v0.43.0 and it seems to have been fixed. Which version are you using?

luistilingue · 2022-03-11T10:12:28Z

@ankitnayan I was using 0.40.0 and I upgraded all my stack (otel collector to 0.46.0, prometheus to 2.33.5, and javaagent to 1.11.1), but the issue still occurs.

Bump github.com/klauspost/compress from 1.14.4 to 1.15.0 Bump github.com/shirou/gopsutil/v3 from 3.22.1 to 3.22.2 Bump go.uber.org/multierr from 1.7.0 to 1.8.0

balintzs · 2022-10-19T12:03:06Z

Hello all, I believe this is caused by a bug we found in prometheus that causes the +Inf bucket to be added incorrectly, which in turn results in a negative number when converting cumulative datapoints to delta: prometheus/client_golang#1147

We faced an issue whereby New Relic dropped our datapoints because of this. The issue existed with 0.61.0 but became much worse with 0.62.0. We built a custom image updating github.com/prometheus/client_golang to the SHA version (dcea97eee2b3257f34fd3203cb922eedeabb42a6) that contained our fix and the issue disappeared:

cc @TylerHelmuth

balintzs · 2022-10-19T12:18:12Z

Actually, looking at it, it is not the same bug, but the same issue:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/prometheusremotewrite/helper.go#L350

The +Inf bucket should have the same value as the total count:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/translator/prometheusremotewrite/helper.go#L306

github-actions · 2022-10-19T15:16:43Z

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2022-10-19T15:16:47Z

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2022-12-19T03:30:39Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9
exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-03-19T05:17:32Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 30, 2021

alolita added the comp:prometheus Prometheus related issues label Sep 2, 2021

luistilingue mentioned this issue Mar 21, 2022

[spanmetricsprocessor] lacenty_bucket histogram +Inf label inconsistency #2838

Closed

balintzs mentioned this issue Oct 19, 2022

Prometheus fix inf bucket count #15287

Merged

TylerHelmuth added exporter/prometheusremotewrite exporter/prometheus labels Oct 19, 2022

github-actions bot added the Stale label Dec 19, 2022

github-actions bot added the closed as inactive label Mar 19, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results between prometheusexporter and prometheusremotewrite #4975

Inconsistent results between prometheusexporter and prometheusremotewrite #4975

albertteoh commented May 8, 2021

ankitnayan commented Oct 17, 2021

albertteoh commented Dec 3, 2021

luistilingue commented Mar 10, 2022 •

edited

Loading

ankitnayan commented Mar 11, 2022

luistilingue commented Mar 11, 2022

balintzs commented Oct 19, 2022 •

edited

Loading

balintzs commented Oct 19, 2022

github-actions bot commented Oct 19, 2022

github-actions bot commented Oct 19, 2022

github-actions bot commented Dec 19, 2022

github-actions bot commented Mar 19, 2023

Inconsistent results between prometheusexporter and prometheusremotewrite #4975

Inconsistent results between prometheusexporter and prometheusremotewrite #4975

Comments

albertteoh commented May 8, 2021

ankitnayan commented Oct 17, 2021

albertteoh commented Dec 3, 2021

luistilingue commented Mar 10, 2022 • edited Loading

ankitnayan commented Mar 11, 2022

luistilingue commented Mar 11, 2022

balintzs commented Oct 19, 2022 • edited Loading

balintzs commented Oct 19, 2022

github-actions bot commented Oct 19, 2022

github-actions bot commented Oct 19, 2022

github-actions bot commented Dec 19, 2022

github-actions bot commented Mar 19, 2023

luistilingue commented Mar 10, 2022 •

edited

Loading

balintzs commented Oct 19, 2022 •

edited

Loading