Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingesters can start in broken state and cannot write traces to disk for one or more tenants #3346

Closed
mdisibio opened this issue Jan 29, 2024 · 1 comment · Fixed by #3358
Closed
Labels
type/bug Something isn't working

Comments

@mdisibio
Copy link
Contributor

Describe the bug
After a crash or unclean restart (OOM or readinessprobe failure) an ingester can enter a bad state where it is unable to write traces to the headblock. It continually logs the following error and is unable to recover on its own. The error means that all incoming data for the tenant is lost because it cannot be persisted to disk.

msg="failed to cut traces" err="error writing meta json: open /var/tempo/wal/<guid>+<tenant>+vParquet3/meta.json: no such file or directory"

RF=3 helps mitigate this issue because it prevents data loss from 1 failing ingester. However not if there are 2 or more.

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo main
  2. Ingesters get killed for various reasons: OOMs or readiness probe failures.
  3. Large numbers of tenants, and pressure on the read path such as search and autocomplete lookups increase the likelihood of this.
  4. Eventually an ingester will start in this state for 1 or more tenants.

Expected behavior
Ingesters are able to recover from all crashes and unclean restarts. Existing data in memory or partially written blocks might be lost, but the ingester should always be able to receive new data.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet

Additional Context

@mdisibio mdisibio added the type/bug Something isn't working label Jan 29, 2024
@mdisibio
Copy link
Contributor Author

mdisibio commented Feb 1, 2024

Updating findings: In one case the issue was caused by the ingester replaying (and deleting) the active headblock (logs failed to replay block. removing. Then the no such file or directory error happens on the next flush because the headblock directory is gone. Receiving traffic and flushing should not happen until replay is complete, so the issue seems to related to timing on startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
1 participant