Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos-Store empty bocks in local storage #1610

Closed
KarstenSiemer opened this issue Oct 7, 2019 · 7 comments
Closed

Thanos-Store empty bocks in local storage #1610

KarstenSiemer opened this issue Oct 7, 2019 · 7 comments

Comments

@KarstenSiemer
Copy link

Hi there!
I am using the quay.io/thanos/thanos:v0.7.0 container and i am experiencing problems with the store component.
The store is missing metadata from it's bocks inside it's local storage.
But the metadata exists in the s3 bucket.
Store log:

 level=warn ts=2019-10-07T07:15:12.145791006Z caller=bucket.go:325 msg="error parsing block range" block=01DPJ5368THP909JKH2DW72JJM err="read meta.json: open /thanos-store-data/01DPJ5368THP909JKH2DW72JJM/meta.json: no such file or directory"

S3 bucket ls:

2019-10-07 03:47 536864293   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000001
2019-10-07 03:48 536857676   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000002
2019-10-07 03:48 536860520   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000003
2019-10-07 03:48 536864881   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000004
2019-10-07 03:48 536863844   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000005
2019-10-07 03:48 536865851   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000006
2019-10-07 03:49 536771867   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000007
2019-10-07 03:49 536685857   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000008
2019-10-07 03:49 536868988   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000009
2019-10-07 03:49 536868010   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000010
2019-10-07 03:49 536868033   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000011
2019-10-07 03:50 536869076   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000012
2019-10-07 03:50 536870302   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000013
2019-10-07 03:50 516198435   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/chunks/000014
2019-10-07 03:50 519912306   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/index
2019-10-07 03:50  13922915   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/index.cache.json
2019-10-07 03:50      1997   s3://de.y6b.system.prometheus/01DPJ5368THP909JKH2DW72JJM/meta.json

The metadata and index file is actually missing when i take a look into the data directory of the store component for that block.In the web-ui from the querier the store looks healthy and also has the correct min and max time ranges.When i restart the store, it comes back up healthy and all the metadata from the before faulty bocks are there now and are query-able.
But eventually the data goes missing and holes are represented in the graphs done by the queriers.
A restart always fixes that.This only recently started after updating to version v0.7.0
What might be important to note here is, that i run a daily bucket verify job on the bucket, while the compactor is actually still running.
But the bucket verify is always configured without the repair flag.
After restarting a store and then running the verifier does not cause holes.
I cannot manually recreate the problem, it only eventually happens after some time.I'd be very thankful for any help

@bwplotka
Copy link
Member

bwplotka commented Oct 7, 2019

Thanks for this. Do you have persistent volume? It looks really like the issue we fixed recently with this, which will be released soon: https://github.com/thanos-io/thanos/blob/master/CHANGELOG.md#fixed

Can you try running master? E.g master-2019-10-06-bb1ac398

@bwplotka
Copy link
Member

bwplotka commented Oct 7, 2019

Next release is this week (:

@FUSAKLA
Copy link
Member

FUSAKLA commented Oct 7, 2019

Hi, I wonder if this is related to shis issue #1504 ?

It's interesting that it gets fixed after restart. Do you have persistent storage on that store? In my case it persisted after restart so a added check to erease malformed blocks. It got merged after 0.7.0 was released IIRC could you try recent master?

@FUSAKLA
Copy link
Member

FUSAKLA commented Oct 7, 2019

Hah, Bartek was faster :)

I still wonder how those malformed blocks happen to be.

@bwplotka
Copy link
Member

bwplotka commented Oct 7, 2019

It's quite straightforward. Check #1505 (review) for explanation.

@KarstenSiemer
Copy link
Author

Thanks for the quick response!
I do not use a persistent volume. It is saved into an empty dir.
Should i rather add a persistent volume to the store? I figured that it is unnecessary, since i have persistence inside s3. I have roughly 4TB of metrics in total in my s3. Keeping data inside the store after a restart didn't seem resourceful since pods rarely restart in my cluster.
I will try master and come back to you guys if it happens again.
Thanks so much 👍

@KarstenSiemer
Copy link
Author

Just for readers that have run into this problem, since using version v0.8.1 I did not experience this problem again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants