Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete completed local blocks when replaying wal #939

Merged
merged 5 commits into from
Sep 8, 2021

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Sep 7, 2021

What this PR does:
This PR fixes #937 by deleting locally complete blocks during wal replay, and another race condition on ingester startup. Here is a description of the situation and why this should fix it.

  1. An ingester may restart or crash while completing the wal file. It can happen after creating the new local block, but before deleting the wal file.
  2. On restart the ingester replays wal, and rediscovers local blocks. Both exist it ends up with 2 queue entries instead of 1. It will both recomplete the wal and reflush the local block. These occur independently and there are no consistency checks while processing the queue entry.
  3. This is a race condition in multiple ways.
    a. The wal complete is in progress, writing over top of data when it is flushed to the backend. The flush will reach EOF and assume all went well. This creates a partially written block in the backend.
    b. The wal complete can entirely finish before the flush. It appends another entry into completedBlocks. When the flush occurs it reads the first entry from completedBlocks and saves it as the meta data. This is the actual condition for magic number errors and incorrect meta. It uses the meta as rediscovered in step 2, which may be a different encoding, etc.
    c. The wal complete can start on a new block before rediscoverLocalBlocks begins. rediscoverLocalBlocks sees a bad block (missing meta) and deletes it. I think this ends up in a situation similar to (a) but different causes. This is handled by not rediscovering the local block is there is still a wal for it.

A couple more notes:

  • Enabling search increased the risk of these errors because it added more work between creating the local block and deleting the wal in step 1.
  • We've also seen issues with partially flushed blocks like 3a and 3b, but afaik there isn't an issue tracking them.

Which issue(s) this PR fixes:
Fixes #937

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…corrupt/broken blocks after restarts/crashes

Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
modules/ingester/ingester.go Outdated Show resolved Hide resolved
Signed-off-by: Martin Disibio <[email protected]>
…e a wal file in the middle of replay. Fix nit

Signed-off-by: Martin Disibio <[email protected]>
@mdisibio mdisibio merged commit f57d781 into grafana:main Sep 8, 2021
@mdisibio mdisibio added the type/bug Something isn't working label Sep 9, 2021
@mdisibio mdisibio deleted the wal-replay-delete-complete-blocks branch September 15, 2021 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Magic number error"s after changing compression format on the ingesters
3 participants