Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction halt when "overlapping sources detected for plan" #5806

Closed
humblebundledore opened this issue Mar 7, 2024 · 7 comments · Fixed by #5854
Closed

compaction halt when "overlapping sources detected for plan" #5806

humblebundledore opened this issue Mar 7, 2024 · 7 comments · Fixed by #5854

Comments

@humblebundledore
Copy link

Describe the bug
Compactor halt compaction when hitting "overlapping sources detected for plan" level=error.

Since the plan will be retry indefinitely no new blocks will be compacted and only solution is to mark block for no-compact using thanos tools bucket.

Although we are using skip_blocks_with_out_of_order_chunks_enabled: true configuration, the block is not being marked as non-compact (possibly because root cause is something else than ooo chuncks).

To Reproduce
Unable to reproduce for now, simply noticed in our cortex environment.

ts=2024-03-07T09:35:55.633278609Z caller=compactor.go:712 level=error component=compactor msg="failed to compact user blocks" user=tenant-1 err="compaction: group 0@1434040103434464048: failed to run pre compaction callback for plan: [01HF36TN8MEB08EXJSK528JHNN (min time: 1699833600000, max time: 1699840800000) 01HF3C12751TH3HFAMKADKNQCX (min time: 1699833600000, max time: 1699840800000) 01HF3C149T2NT6YZ3MDCATMJHE (min time: 1699833600000, max time: 1699840800000) 01HF3C2V5DNFJ8KN93JK99XQSH (min time: 1699833600000, max time: 1699840800000) 01HF3C1238ZQAQD4B18GCZDE8A (min time: 1699833600000, max time: 1699840800000)]: overlapping sources detected for plan [01HF36TN8MEB08EXJSK528JHNN (min time: 1699833600000, max time: 1699840800000) 01HF3C12751TH3HFAMKADKNQCX (min time: 1699833600000, max time: 1699840800000) 01HF3C149T2NT6YZ3MDCATMJHE (min time: 1699833600000, max time: 1699840800000) 01HF3C2V5DNFJ8KN93JK99XQSH (min time: 1699833600000, max time: 1699840800000) 01HF3C1238ZQAQD4B18GCZDE8A (min time: 1699833600000, max time: 1699840800000)]"

Expected behavior
Unsure what expected behavior should be but a skip_blocks_ should should be provided to continue compaction.

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: helm
  • Cortex version: cortex:v1.16.0-rc.0
@yeya24
Copy link
Contributor

yeya24 commented Mar 12, 2024

Can you guys help take a look?
Maybe @danielblando @alexqyle

@alexqyle
Copy link
Contributor

The error is overlapping sources detected for plan which cannot be skipped by skip_blocks_with_out_of_order_chunks_enabled config.

Based on log, all 5 blocks in this compaction plan are having min time: 1699833600000, max time: 1699840800000. Probably they are having some common source blocks. In this case, it is considered as overlapping blocks. Here is the code doing this overlapping check: https://github.com/cortexproject/cortex/blob/master/vendor/github.com/thanos-io/thanos/pkg/compact/compact.go#L817

Could you please check meta.json of those blocks to validate if there are common source blocks among them?

@friedrichg
Copy link
Member

I had this exact problem today, it happened because 2 compactors were running against same s3 bucket for hours for the same user.
ts=2024-03-14T17:21:42.234667096Z caller=compactor.go:712 level=error component=compactor msg="failed to compact user blocks" user=x err="compaction: group 0@4082378620489593290: failed to run pre compaction callback for plan: [01HRY6GWZ39ZXGPRYR18FV7YQ7 (min time: 1710396000000, max time: 1710403200000) 01HRY538CENVKAQ7G8BA77WJ2X (min time: 1710396000000, max time: 1710403200000)]: overlapping sources detected

@yeya24
Copy link
Contributor

yeya24 commented Mar 14, 2024

Just want to double check. @friedrichg @AlexandreRoux, do you enable out of order samples feature?

@friedrichg
Copy link
Member

@yeya24 No, we don't. We also don't use shuffle sharding in compactors yet. (Cortex v1.16.0)
It's literally caused by running 2 compactors for the same user. We did that by mistake.

@yeya24
Copy link
Contributor

yeya24 commented Mar 21, 2024

I think this might happen if out of order samples is enabled because a single block might be compacted twice and got uploaded to the bucket.
I wonder if it might be related to the shuffle sharding compactor where two compactors compacts the same block.

@yeya24
Copy link
Contributor

yeya24 commented Apr 25, 2024

https://github.com/cortexproject/cortex/releases/tag/v1.17.0-rc.0
Latest version of Cortex is out and it should fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants