Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester readonly on startup until replay and rediscover is done to prevent broken head blocks #3358

Merged
merged 2 commits into from
Feb 2, 2024

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Feb 2, 2024

What this PR does:
This is one of those bugs that makes you wonder "how did this ever work?" This PR changes the ingester to enter a read-only state on startup (similar to shutdown) until wal replay and local blocks rediscover are complete. This prevents pushes from creating headblocks that get tangled up in the replay process (and broken). I really like having the startup error different than the shutting down error. It makes it easy to check logs for Ingester is starting and verify the fix is working.

Description of the Bug
Here is a full sequence of the bug:

  1. An Ingester unexpectedly terminates (OOM, panic, readinessprobe failure etc)
  2. The ingester restarts and begins WAL replay
  3. Because it didn't leave the ring gracefully and propagating the new state takes some time, it still appears HEALTHY to (some or all) the distributors.
  4. The distributors push traffic to it while it is replaying
  5. PushBytes works (unexpectedly) and creates a headblock <--- This is where we fix it
  6. The headblock gets picked up by the replay process and deleted
  7. All pushes to that headblock fail and the ingester never recovers

Steps to Reproduce

  1. Requires a distributed setup with separate distributors, ingesters, and replication factor >= 2
  2. While continuously pushing traffic, quickly kill and restart a single ingester via docker kill/etc
  3. Only kill 1 ingester, so that the distributor sees the minimum replicas and keeps pushing traffic
  4. Eventually it will trigger the bug

Which issue(s) this PR fixes:
Fixes #3346

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so during the period of startup/shutdown there will be one ingester always failing meaning reduced availability. we're doing it now as well, but only during shutdown.

i am concerned that during a k8s rollout of a large cluster if one ingester is starting up and one is shutting down we'll start failing writes.

is there anyway to just not raise the healthy flag in the ring until we've done wal replay?

@mdisibio
Copy link
Contributor Author

mdisibio commented Feb 2, 2024

is there anyway to just not raise the healthy flag in the ring until we've done wal replay?

It already doesn't raise the healthy flag, but the issue is the non-zero propagation time during which the distributors continue sending traffic. We should expect to see a short burst of "Ingester is starting" errors until the distributors catch up. But I can dig more in this area and double check things.

Edit: Wanted to add a bit more: When an ingester dies and restarts quickly, it is immediately receiving writes on the same ip/port, because it's ring state hasn't propagated yet to the distributors. I don't think any amount of ring manipulation would fix that. We'd have to do something else like not start gRPC.

@joe-elliott
Copy link
Member

It already doesn't raise the healthy flag, but the issue is the non-zero propagation time during which the distributors continue sending traffic.

Oh, I see. It's explained in your original steps. In a normal rollout this shouldn't even occur.

@mdisibio mdisibio merged commit c1f9fd9 into grafana:main Feb 2, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingesters can start in broken state and cannot write traces to disk for one or more tenants
2 participants