Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: index out of range in the ingester with native histograms #5576

Closed
aptomaKetil opened this issue Jul 24, 2023 · 18 comments · Fixed by #6107
Closed

panic: runtime error: index out of range in the ingester with native histograms #5576

aptomaKetil opened this issue Jul 24, 2023 · 18 comments · Fixed by #6107
Assignees

Comments

@aptomaKetil
Copy link

Describe the bug

Ingesting native histograms can cause a panic in counterResetInAnyBucket which then corrupts the WAL.

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir (r248-c7fbeb7)
  2. Start OTEL Collector (8a4348cb009d1448dc647f2bdad27304b8ce53d3)
  3. Generate exponential histograms via OTEL client libs, push these to the collector and have the collector write them to Mimir via the prometheusremotewrite exporter as native histograms.

Expected behavior

No panics.

Environment

  • Infrastructure: kubernetes
  • Deployment tool: helm

Additional Context

I decided to kick the tires on the histogram downsampling (from open-telemetry/opentelemetry-collector-contrib#24026), as we'd like to start using these from the OTEL libs. Reconfiguring the collector to write downsampled native histograms immediately led to the below error in the mimir ingesters. After restarting I then got the same error during WAL replay. Ended up having to wipe the WALs completely. I see there has been a previous PR for counterResetInAnyBucket which looks very similar in prometheus/prometheus#12357.

Logs:

2023-07-24 10:37:12	panic: runtime error: index out of range [2] with length 2
2023-07-24 10:37:12
2023-07-24 10:37:12	goroutine 1212 [running]:
2023-07-24 10:37:12	github.com/prometheus/prometheus/tsdb/chunkenc.counterResetInAnyBucket({0x405595bd20?, 0xa624e0?, 0x4000e9bc48?}, {0x40584a9330?, 0x40584acd00?, 0x1898705c49c?}, {0x405595bd10?, 0x26e6ca0?, 0x400138e000?}, {0x40584a4558?, ...})
2023-07-24 10:37:12		/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunkenc/histogram.go:423 +0x130
2023-07-24 10:37:12	github.com/prometheus/prometheus/tsdb/chunkenc.(*HistogramAppender).Appendable(0x4056129c20, 0x40584d65a0)
2023-07-24 10:37:12		/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunkenc/histogram.go:309 +0x1c4
2023-07-24 10:37:12	github.com/prometheus/prometheus/tsdb.(*memSeries).appendHistogram(0x4055e5aea0, 0x1898705c49c, 0x40584d65a0, 0x0, {{0x26e6ca0?, 0x400138e000?}, 0x0?, 0x0?})
2023-07-24 10:37:12		/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_append.go:1197 +0x1d4
2023-07-24 10:37:12	github.com/prometheus/prometheus/tsdb.(*walSubsetProcessor).processWALSamples(0x40051ba158, 0x4000d6f000, 0x0?, 0x0?)
2023-07-24 10:37:12		/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_wal.go:627 +0x4e0
2023-07-24 10:37:12	github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func2(0x403ebf2d88?)
2023-07-24 10:37:12		/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_wal.go:106 +0x48
2023-07-24 10:37:12	created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
2023-07-24 10:37:12		/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_wal.go:105 +0x23c
@aptomaKetil
Copy link
Author

Can confirm that this also happens with mimir v2.9.0 and otel collector v0.82.0. Pushing downsampled native histograms to mimir eventually triggers this error, which causes the ingester to crash. It will then continue to crash on startup until the WAL is wiped. Every time I've tried upgrading to collector v0.82.0 to make use of the downscaling, this issue appears within 24hrs. We're producing native histograms via OTEL client libs (js/node), and the spanmetrics connector in the otel collector.

@krajorama krajorama self-assigned this Sep 7, 2023
krajorama added a commit that referenced this issue Sep 7, 2023
Add experimental support for testing otel collector together with Mimir.

For reproducing #5576

Signed-off-by: György Krajcsovits <[email protected]>
@krajorama
Copy link
Contributor

Hi, I've started to work on reproduction, is a very basic otel collector configuration such as the one in https://github.com/grafana/mimir/pull/5968/files sufficient? (I'll keep working at it, but don't want to go down the wrong path.)

krajorama added a commit that referenced this issue Sep 8, 2023
* dev docker compose: add otel collector with prom remote write

Add experimental support for testing otel collector together with Mimir.

For reproducing #5576

Signed-off-by: György Krajcsovits <[email protected]>
@krajorama
Copy link
Contributor

@aptomaKetil Hi, do you still have this problem? Is there any chance you could share a corrupt WAL I could analyze? I've got a simple setup going with random observations generator 0m - 5m observing into a scale 20 exponential histogram -> otel collector -> remote write -> mimir 2.9. But so far nothing.

@aptomaKetil
Copy link
Author

Hi @krajorama I'll fire up a dev-cluster on 2.9 and see if I can still reproduce.

@aptomaKetil
Copy link
Author

I can still reproduce with 2.9 and collector 0.84. I've kept the broken WAL (~135MB zipped), and can share it in private if you can suggest a sensible way of doing so. Pasting my collector config, I've removed all trace-stuff for brevity:

exporters:
  prometheusremotewrite:
    auth:
      authenticator: basicauth/mimir
    endpoint: <snip>
    target_info:
      enabled: false
extensions:
  basicauth/mimir:
    client_auth:
      password: ${mimir_password}
      username: ${mimir_username}
  health_check: {}
  memory_ballast:
    size_in_percentage: 40
processors:
  batch: {}
  filter:
    error_mode: ignore
    metrics:
      metric:
      - Len(data_points) == 0
  k8sattributes:
    extract:
      labels:
      - from: pod
        key: app.kubernetes.io/name
        tag_name: __app
      - from: pod
        key: app
        tag_name: __app
      - from: pod
        key: k8s-app
        tag_name: __app
      - from: pod
        key: app.kubernetes.io/version
        tag_name: __version
      metadata:
      - k8s.namespace.name
      - k8s.pod.name
      - k8s.node.name
    pod_association:
    - sources:
      - from: connection
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25
  resource:
    attributes:
    - action: upsert
      from_attribute: __app
      key: service.name
    - action: upsert
      from_attribute: __version
      key: service.version
    - action: insert
      from_attribute: k8s.pod.name
      key: service.instance.id
    - action: insert
      from_attribute: k8s.namespace.name
      key: service.namespace
    - action: insert
      from_attribute: k8s.node.name
      key: host.name
  transform:
    error_mode: ignore
    metric_statements:
    - context: datapoint
      statements:
      - set(attributes["cluster"], resource.attributes["deployment.environment"])
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
service:
  extensions:
  - health_check
  - basicauth/mimir
  pipelines:
    metrics:
      exporters:
      - prometheusremotewrite
      processors:
      - memory_limiter
      - filter
      - k8sattributes
      - resource
      - transform
      - batch
      receivers:
      - otlp

@krajorama
Copy link
Contributor

krajorama commented Sep 13, 2023

hi, if you have a google account I can share a drive folder you could upload to (or the other way around). Please drop me an email

@krajorama
Copy link
Contributor

I've just built Prometheus from main and tried to load your WAL:

ts=2023-09-13T09:55:38.993Z caller=main.go:539 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2023-09-13T09:55:38.993Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.47.0, branch=main, revision=4419399e4e831052bc9b8b4f26a8e7cf337091b0)"
ts=2023-09-13T09:55:38.993Z caller=main.go:588 level=info build_context="(go=go1.20.8, platform=linux/amd64, user=krajo@hpkrajo, date=20230913-09:47:33, tags=netgo,builtinassets,stringlabels)"
ts=2023-09-13T09:55:38.993Z caller=main.go:589 level=info host_details="(Linux 5.15.0-82-generic #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023 x86_64 hpkrajo (none))"
ts=2023-09-13T09:55:38.993Z caller=main.go:590 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-09-13T09:55:38.993Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-09-13T09:55:38.995Z caller=web.go:566 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-09-13T09:55:38.997Z caller=main.go:1024 level=info msg="Starting TSDB ..."
ts=2023-09-13T09:55:38.999Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090
ts=2023-09-13T09:55:38.999Z caller=tls_config.go:277 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2023-09-13T09:55:39.003Z caller=head.go:600 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2023-09-13T09:55:39.003Z caller=head.go:681 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.26µs
ts=2023-09-13T09:55:39.003Z caller=head.go:689 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2023-09-13T09:55:39.568Z caller=head.go:725 level=info component=tsdb msg="WAL checkpoint loaded"
ts=2023-09-13T09:55:40.507Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=858 maxSegment=870
ts=2023-09-13T09:55:41.780Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=859 maxSegment=870
ts=2023-09-13T09:55:42.775Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=860 maxSegment=870
panic: runtime error: index out of range [26] with length 26

goroutine 494 [running]:
github.com/prometheus/prometheus/tsdb/chunkenc.counterResetInAnyBucket({0xc01625d380?, 0x0?, 0x7f05fd752d00?}, {0xc014367860, 0x1a, 0x80?}, {0xc0172a0c18, 0x3, 0x0?}, {0xc01590c2b0, ...})
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/chunkenc/histogram.go:394 +0x23e
github.com/prometheus/prometheus/tsdb/chunkenc.(*HistogramAppender).appendable(0xc00a7f8000, 0xc017699b00)
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/chunkenc/histogram.go:300 +0x225
github.com/prometheus/prometheus/tsdb/chunkenc.(*HistogramAppender).AppendHistogram(0xc00a7f8000, 0x18a8ad99576?, 0xc008f54f02?, 0xc017699b00, 0x0)
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/chunkenc/histogram.go:637 +0x8e
github.com/prometheus/prometheus/tsdb.(*memSeries).appendHistogram(0xc007ed5380, 0x18a8ad99576, 0xc017699b00, 0x0, {0xc0001985a0?, 0x0?, 0x0?})
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/head_append.go:1178 +0xdd
github.com/prometheus/prometheus/tsdb.(*walSubsetProcessor).processWALSamples(0xc00e631848, 0xc000053800, 0x0?, 0x0?)
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/head_wal.go:623 +0x675
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func2(0x0?)
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/head_wal.go:106 +0x4d
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
	/home/krajo/go/github.com/krajorama/prometheus/tsdb/head_wal.go:105 +0x31f

As suspected the bug is in Prometheus, not Mimir, I'm opening an issue in Prometheus to link this to.

@krajorama
Copy link
Contributor

Attached debugger:
image

@krajorama
Copy link
Contributor

Opened PR with unit test that triggers this case: prometheus/prometheus#12838

@krajorama
Copy link
Contributor

@aptomaKetil I have a potential fix in Prometheus prometheus/prometheus#12838 . With the fix I could load the WAL into Prometheus. I'll have to create a Mimir build with the fix next for you to test.

$ ~/go/github.com/krajorama/prometheus/prometheus --enable-feature native-histograms --config.file=prometheus.yml --storage.tsdb.path=`pwd`/db
ts=2023-09-13T13:10:13.310Z caller=main.go:206 level=info msg="Experimental native histogram support enabled."
ts=2023-09-13T13:10:13.310Z caller=main.go:539 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2023-09-13T13:10:13.310Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.47.0, branch=fix-disappearing-span-panic, revision=6d992106f61111424fdd8111d0e72f76c42939dc)"
ts=2023-09-13T13:10:13.310Z caller=main.go:588 level=info build_context="(go=go1.20.8, platform=linux/amd64, user=krajo@hpkrajo, date=20230913-13:07:27, tags=netgo,builtinassets,stringlabels)"
ts=2023-09-13T13:10:13.310Z caller=main.go:589 level=info host_details="(Linux 5.15.0-82-generic #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023 x86_64 hpkrajo (none))"
ts=2023-09-13T13:10:13.310Z caller=main.go:590 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-09-13T13:10:13.310Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-09-13T13:10:13.313Z caller=web.go:566 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-09-13T13:10:13.314Z caller=main.go:1024 level=info msg="Starting TSDB ..."
ts=2023-09-13T13:10:13.315Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/home/krajo/Documents/ceges/native-histograms/issue-5576-panic/db/lock
ts=2023-09-13T13:10:13.319Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090
ts=2023-09-13T13:10:13.320Z caller=tls_config.go:277 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2023-09-13T13:10:13.322Z caller=head.go:600 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2023-09-13T13:10:13.446Z caller=head.go:681 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=124.748151ms
ts=2023-09-13T13:10:13.446Z caller=head.go:689 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2023-09-13T13:10:14.077Z caller=head.go:725 level=info component=tsdb msg="WAL checkpoint loaded"
ts=2023-09-13T13:10:15.164Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=858 maxSegment=876
ts=2023-09-13T13:10:16.173Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=859 maxSegment=876
ts=2023-09-13T13:10:17.159Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=860 maxSegment=876
ts=2023-09-13T13:10:17.633Z caller=head_wal.go:425 level=warn component=tsdb msg="Unknown series references" samples=0 exemplars=5 histograms=0 metadata=0
ts=2023-09-13T13:10:17.633Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=861 maxSegment=876
ts=2023-09-13T13:10:17.633Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=862 maxSegment=876
ts=2023-09-13T13:10:17.634Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=863 maxSegment=876
ts=2023-09-13T13:10:17.634Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=864 maxSegment=876
ts=2023-09-13T13:10:17.635Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=865 maxSegment=876
ts=2023-09-13T13:10:17.635Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=866 maxSegment=876
ts=2023-09-13T13:10:17.635Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=867 maxSegment=876
ts=2023-09-13T13:10:17.635Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=868 maxSegment=876
ts=2023-09-13T13:10:17.635Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=869 maxSegment=876
ts=2023-09-13T13:10:17.635Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=870 maxSegment=876
ts=2023-09-13T13:10:17.636Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=871 maxSegment=876
ts=2023-09-13T13:10:17.636Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=872 maxSegment=876
ts=2023-09-13T13:10:17.636Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=873 maxSegment=876
ts=2023-09-13T13:10:17.636Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=874 maxSegment=876
ts=2023-09-13T13:10:17.636Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=875 maxSegment=876
ts=2023-09-13T13:10:17.637Z caller=head.go:760 level=info component=tsdb msg="WAL segment loaded" segment=876 maxSegment=876
ts=2023-09-13T13:10:17.637Z caller=head.go:797 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=630.328259ms wal_replay_duration=3.559738471s wbl_replay_duration=1.467µs total_replay_duration=4.314911334s
ts=2023-09-13T13:10:17.807Z caller=main.go:1045 level=info fs_type=EXT4_SUPER_MAGIC
ts=2023-09-13T13:10:17.807Z caller=main.go:1048 level=info msg="TSDB started"
ts=2023-09-13T13:10:17.807Z caller=main.go:1229 level=info msg="Loading configuration file" filename=prometheus.yml
ts=2023-09-13T13:10:17.807Z caller=main.go:1266 level=info msg="Completed loading of configuration file" filename=prometheus.yml totalDuration=547.916µs db_storage=2.934µs remote_storage=3.283µs web_handler=1.816µs query_engine=2.374µs scrape=249.828µs scrape_sd=9.01µs notify=2.864µs notify_sd=4.331µs rules=3.353µs tracing=9.08µs
ts=2023-09-13T13:10:17.807Z caller=main.go:1009 level=info msg="Server is ready to receive web requests."
ts=2023-09-13T13:10:17.807Z caller=manager.go:1009 level=info component="rule manager" msg="Starting rule manager..."

@krajorama
Copy link
Contributor

@aptomaKetil I've pushed a test image under this name: grafana/mimir:krajo-test-prom-12838-on-2.9-cb50fed07
https://hub.docker.com/layers/grafana/mimir/krajo-test-prom-12838-on-2.9-cb50fed07/images/sha256-8943852fd732dd3518e65d3356956ffe6d38d5fc2be2a3748e3bcdbf85ecc195?context=repo

It is the same as 2.9, but with my proposed fix back ported. From branch/PR: #6016
Please test.

@aptomaKetil
Copy link
Author

@krajorama Thanks! I've rolled this out in our dev environment. Unfortunately it can take a while for the issue to show up (nearly 24hrs last time), but I'll keep you updated.

@aptomaKetil
Copy link
Author

No issues to report so far, I've rolled this out for our prod metrics as well now.

@krajorama
Copy link
Contributor

No issues to report so far, I've rolled this out for our prod metrics as well now.

Thank you. I think we should wait a couple of days. Which means this will probably not make it to Mimir 2.10. I'll negotiate a patch release if we can get more confirmation plus merge into Prometheus upstream. cc release shepherds @colega @pstibrany

@aptomaKetil
Copy link
Author

Still no issues to report.

@krajorama
Copy link
Contributor

So far 2.10.1 is not on the cards, unless we'll have a security vulnerability that triggers it

@aptomaKetil
Copy link
Author

No problem, I can stay on the test-build until the next release. Thanks for looking into this!

krajorama added a commit that referenced this issue Oct 10, 2023
This brings in prometheus/prometheus#12838
Fixes #5576

Signed-off-by: György Krajcsovits <[email protected]>
krajorama added a commit that referenced this issue Oct 10, 2023
@krajorama
Copy link
Contributor

We did have a security fix after all doing 2.10.1 release which will include this: https://github.com/grafana/mimir/releases/tag/mimir-2.10.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants