Delete completed local blocks when replaying wal #939

mdisibio · 2021-09-07T19:35:29Z

What this PR does:
This PR fixes #937 by deleting locally complete blocks during wal replay, and another race condition on ingester startup. Here is a description of the situation and why this should fix it.

An ingester may restart or crash while completing the wal file. It can happen after creating the new local block, but before deleting the wal file.
On restart the ingester replays wal, and rediscovers local blocks. Both exist it ends up with 2 queue entries instead of 1. It will both recomplete the wal and reflush the local block. These occur independently and there are no consistency checks while processing the queue entry.
This is a race condition in multiple ways.
a. The wal complete is in progress, writing over top of data when it is flushed to the backend. The flush will reach EOF and assume all went well. This creates a partially written block in the backend.
b. The wal complete can entirely finish before the flush. It appends another entry into completedBlocks. When the flush occurs it reads the first entry from completedBlocks and saves it as the meta data. This is the actual condition for magic number errors and incorrect meta. It uses the meta as rediscovered in step 2, which may be a different encoding, etc.
c. The wal complete can start on a new block before rediscoverLocalBlocks begins. rediscoverLocalBlocks sees a bad block (missing meta) and deletes it. I think this ends up in a situation similar to (a) but different causes. This is handled by not rediscovering the local block is there is still a wal for it.

A couple more notes:

Enabling search increased the risk of these errors because it added more work between creating the local block and deleting the wal in step 1.
We've also seen issues with partially flushed blocks like 3a and 3b, but afaik there isn't an issue tracking them.

Which issue(s) this PR fixes:
Fixes #937

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…corrupt/broken blocks after restarts/crashes Signed-off-by: Martin Disibio <[email protected]>

Signed-off-by: Martin Disibio <[email protected]>

modules/ingester/ingester.go

Signed-off-by: Martin Disibio <[email protected]>

…e a wal file in the middle of replay. Fix nit Signed-off-by: Martin Disibio <[email protected]>

….stopping() in a test Signed-off-by: Martin Disibio <[email protected]>

mdisibio added 2 commits September 7, 2021 15:18

Delete completed local blocks when replaying wal, to fix issues with …

5d01018

…corrupt/broken blocks after restarts/crashes Signed-off-by: Martin Disibio <[email protected]>

changelog

f162538

Signed-off-by: Martin Disibio <[email protected]>

mdisibio requested review from annanay25, dgzlopes, joe-elliott, kvrhdn and mapno as code owners September 7, 2021 19:35

joe-elliott approved these changes Sep 7, 2021

View reviewed changes

modules/ingester/ingester.go Outdated Show resolved Hide resolved

lint

2d090ac

Signed-off-by: Martin Disibio <[email protected]>

annanay25 approved these changes Sep 8, 2021

View reviewed changes

mdisibio added 2 commits September 8, 2021 08:50

Fix additional race condition where rediscoverLocalBlocks could delet…

45b568d

…e a wal file in the middle of replay. Fix nit Signed-off-by: Martin Disibio <[email protected]>

Fix race condition in flushqueues that appeared when calling ingester…

c390fb9

….stopping() in a test Signed-off-by: Martin Disibio <[email protected]>

mdisibio requested review from annanay25 and joe-elliott September 8, 2021 12:52

mdisibio merged commit f57d781 into grafana:main Sep 8, 2021

mdisibio added the type/bug Something isn't working label Sep 9, 2021

mdisibio deleted the wal-replay-delete-complete-blocks branch September 15, 2021 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete completed local blocks when replaying wal #939

Delete completed local blocks when replaying wal #939

mdisibio commented Sep 7, 2021 •

edited

Loading

Delete completed local blocks when replaying wal #939

Delete completed local blocks when replaying wal #939

Conversation

mdisibio commented Sep 7, 2021 • edited Loading

mdisibio commented Sep 7, 2021 •

edited

Loading