store: "invalid size" when decoding postings. #6545

ryan-suther · 2023-07-20T19:08:54Z

Thanos, Prometheus and Golang version used:
Thanos: main@dc337b23
Prometheus: v0.44.1
Golang: 1.20.2

thanos, version 0.32.0-dev (branch: HEAD, revision: dc337b2)
build user: rsuther@
build date: 20230720-18:24:06
go version: go1.20.2
platform: darwin/amd64
tags: netgo

Object Storage Provider:
CEPH(S3)

What happened:
We were on version v0.30.2 and upgraded to 0.31.0 and experienced the high mem usage while rolling out new deployments. Then we upgraded to the above commit for the high mem usage fix and began experiencing the below error.

xxx.xxx.xxx.xxxx:10901 LabelSets: {shard labels} MinTime: 1678150240752 MaxTime: 1688850857186: rpc error: code = Aborted desc = fetch series for block 01H4SRWSPKY6K1WHN1XG79P85Y: expanded matching posting: expand: invalid size"}

Multiple refreshes of the Grafana dashboard produced different block IDs.

What you expected to happen:
Thanos to operate normally.

How to reproduce it (as minimally and precisely as possible):
We are not able to reliably reproduce the error. I was able to download one of the reported blocks and run some e2e tests using the original query from the dashboard and was not able to reproduce the issue. I did however, get good results from the query.

In the test, I used a filesystem bucket with the reported block, and an in memory index cache. Output of the test:

start store, run InitialSync()

level=info msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=4000000000 maxItems=maxInt
level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=20
level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=5.789802ms duration_ms=5 cached=1 returned=1 partial=0
level=debug msg="loading new block" id=01H4SRWSPKY6K1WHN1XG79P85Y
level=info msg="loaded new block" elapsed=58.86966ms id=01H4SRWSPKY6K1WHN1XG79P85Y

Submit query via store.Series(...) call.

level=debug msg="Blocks source resolutions" blocks=1 MaximumResolution=0 mint=1678150240752 maxt=1688850857186 lset="{_shard_="..."}" spans="Range: 1688601600000-1688774400000 Resolution: 0"

level=debug msg="stats query processed" request="min_time:1678150240752 max_time:1688850857186 matchers: <several matchers> shard_info:<shard_index:3 total_shards:4 labels:\"_shard_\" > " stats="&{blocksQueried:1 postingsTouched:6 PostingsTouchedSizeSum:2MiB538KiB853B postingsToFetch:0 postingsFetched:6 PostingsFetchedSizeSum:10MiB1KiB192B postingsFetchCount:5 PostingsFetchDurationSum:142.70863ms cachedPostingsCompressions:7 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:2MiB538KiB925B CachedPostingsCompressedSizeSum:781KiB687B CachedPostingsCompressionTimeSum:5.656442ms cachedPostingsDecompressions:0 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:0s seriesTouched:18 SeriesTouchedSizeSum:4KiB908B seriesFetched:18 SeriesFetchedSizeSum:371KiB768B seriesFetchCount:3 SeriesFetchDurationSum:2.3962ms chunksTouched:99 ChunksTouchedSizeSum:5KiB38B chunksFetched:99 ChunksFetchedSizeSum:109KiB864B chunksFetchCount:5 ChunksFetchDurationSum:9.772196ms GetAllDuration:81.33475ms mergedSeriesCount:7 mergedChunksCount:99 MergeDuration:3.801363ms DataDownloadedSizeSum:10MiB482KiB800B}" err=null

Run store.Sync(...)

level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=20
level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.346155ms duration_ms=4 cached=1 returned=1 partial=0

store.Series(...) again, with the same query.

level=debug msg="Blocks source resolutions" blocks=1 MaximumResolution=0 mint=1678150240752 maxt=1688850857186 lset="{_shard_="..."}" spans="Range: 1688601600000-1688774400000 Resolution: 0"

level=debug msg="stats query processed" request="min_time:1678150240752 max_time:1688850857186 matchers:<matchers> shard_info:<shard_index:3 total_shards:4 labels:\"_shard_\" > " stats="&{blocksQueried:1 postingsTouched:1 PostingsTouchedSizeSum:50B postingsToFetch:0 postingsFetched:0 PostingsFetchedSizeSum:0B postingsFetchCount:0 PostingsFetchDurationSum:0s cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:0B CachedPostingsCompressedSizeSum:0B CachedPostingsCompressionTimeSum:0s cachedPostingsDecompressions:1 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:708ns seriesTouched:18 SeriesTouchedSizeSum:4KiB908B seriesFetched:0 SeriesFetchedSizeSum:0B seriesFetchCount:0 SeriesFetchDurationSum:0s chunksTouched:99 ChunksTouchedSizeSum:5KiB38B chunksFetched:99 ChunksFetchedSizeSum:109KiB864B chunksFetchCount:5 ChunksFetchDurationSum:1.679806ms GetAllDuration:60.138µs mergedSeriesCount:7 mergedChunksCount:99 MergeDuration:657.334µs DataDownloadedSizeSum:114KiB798B}" err=null

Resulting series looked good, no warnings.

Full logs to relevant components:
The error above is not logged. All other logging for the block(s) appeared normal.

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

GiedriusS · 2023-08-01T08:14:50Z

@yeya24 maybe you've seen this on Cortex side?

GiedriusS · 2023-08-01T08:27:53Z

Was able to repro this consistently, let me try to fix it.

GiedriusS · 2023-08-01T10:25:39Z

Added repro here with prod data #6575, not sure how this data got generated in the first place. In one instance 0xc8 is left at the end of a chunk and in another - 0x83. 🤔

GiedriusS · 2023-08-01T10:35:51Z

@ryan-suther do you use Redis caching in prod perhaps?

ryan-suther · 2023-08-01T10:48:18Z

@GiedriusS No, we do not. We use a normal PVC. Our store args:

        - store
        - --data-dir=/data
        - --index-cache-size=1GB
        - --chunk-pool-size=4GB
        - --max-time=-9d
        - --min-time=-400d

GiedriusS · 2023-08-01T14:39:16Z

Somehow fixing the following data race also fixed this problem for me: #6575.

GiedriusS · 2023-08-10T06:06:17Z

Still reproducible, unfortunately, trying to fix it. Breaks long-term querying with index cache :'/

Snappy works on byte level and it can cut two different chunks in the middle of a varint. Thus, if there's some error from the Decbuf then fill up the buffer and try reading a varint again. Added repro test. Closes thanos-io#6545. Signed-off-by: Giedrius Statkevičius <[email protected]>

GiedriusS · 2023-08-10T10:47:39Z

Will finally be fixed in #6602

Snappy works on byte level and it can cut two different chunks in the middle of a varint. Thus, if there's some error from the Decbuf then fill up the buffer and try reading a varint again. Added repro test. Closes thanos-io#6545. Signed-off-by: Giedrius Statkevičius <[email protected]>

Snappy works on byte level and it can cut two different chunks in the middle of a varint. Thus, if there's some error from the Decbuf then fill up the buffer and try reading a varint again. Added repro test. Closes #6545. Signed-off-by: Giedrius Statkevičius <[email protected]>

ylazaar · 2023-09-09T22:06:40Z

Im still seeing this issue in v0.32.2

yeya24 · 2023-09-10T16:59:13Z

@ylazaar IIUC it is a different issue as the error is from compactor, not store gateway. Would you mind creating a separate issue?

GiedriusS added bug component: store labels Aug 1, 2023

GiedriusS mentioned this issue Aug 1, 2023

store/bucket: fix data race #6575

Merged

GiedriusS added the priority: P0 label Aug 10, 2023

GiedriusS mentioned this issue Aug 10, 2023

store: properly handle snappy compression continuations #6602

Merged

saswatamcode closed this as completed in #6602 Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: "invalid size" when decoding postings. #6545

store: "invalid size" when decoding postings. #6545

ryan-suther commented Jul 20, 2023

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 1, 2023

ryan-suther commented Aug 1, 2023 •

edited

Loading

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 10, 2023 •

edited

Loading

GiedriusS commented Aug 10, 2023

ylazaar commented Sep 9, 2023

yeya24 commented Sep 10, 2023

store: "invalid size" when decoding postings. #6545

store: "invalid size" when decoding postings. #6545

Comments

ryan-suther commented Jul 20, 2023

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 1, 2023

ryan-suther commented Aug 1, 2023 • edited Loading

GiedriusS commented Aug 1, 2023

GiedriusS commented Aug 10, 2023 • edited Loading

GiedriusS commented Aug 10, 2023

ylazaar commented Sep 9, 2023

yeya24 commented Sep 10, 2023

ryan-suther commented Aug 1, 2023 •

edited

Loading

GiedriusS commented Aug 10, 2023 •

edited

Loading