Ingesters can start in broken state and cannot write traces to disk for one or more tenants #3346

mdisibio · 2024-01-29T13:52:54Z

Describe the bug
After a crash or unclean restart (OOM or readinessprobe failure) an ingester can enter a bad state where it is unable to write traces to the headblock. It continually logs the following error and is unable to recover on its own. The error means that all incoming data for the tenant is lost because it cannot be persisted to disk.

msg="failed to cut traces" err="error writing meta json: open /var/tempo/wal/<guid>+<tenant>+vParquet3/meta.json: no such file or directory"

RF=3 helps mitigate this issue because it prevents data loss from 1 failing ingester. However not if there are 2 or more.

To Reproduce
Steps to reproduce the behavior:

Start Tempo main
Ingesters get killed for various reasons: OOMs or readiness probe failures.
Large numbers of tenants, and pressure on the read path such as search and autocomplete lookups increase the likelihood of this.
Eventually an ingester will start in this state for 1 or more tenants.

Expected behavior
Ingesters are able to recover from all crashes and unclean restarts. Existing data in memory or partially written blocks might be lost, but the ingester should always be able to receive new data.

Environment:

Infrastructure: Kubernetes
Deployment tool: jsonnet

Additional Context

The text was updated successfully, but these errors were encountered:

mdisibio · 2024-02-01T14:04:08Z

Updating findings: In one case the issue was caused by the ingester replaying (and deleting) the active headblock (logs failed to replay block. removing. Then the no such file or directory error happens on the next flush because the headblock directory is gone. Receiving traffic and flushing should not happen until replay is complete, so the issue seems to related to timing on startup.

mdisibio added the type/bug Something isn't working label Jan 29, 2024

mdisibio mentioned this issue Feb 2, 2024

Ingester readonly on startup until replay and rediscover is done to prevent broken head blocks #3358

Merged

3 tasks

mdisibio closed this as completed in #3358 Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingesters can start in broken state and cannot write traces to disk for one or more tenants #3346

Ingesters can start in broken state and cannot write traces to disk for one or more tenants #3346

mdisibio commented Jan 29, 2024

mdisibio commented Feb 1, 2024 •

edited

Loading

Ingesters can start in broken state and cannot write traces to disk for one or more tenants #3346

Ingesters can start in broken state and cannot write traces to disk for one or more tenants #3346

Comments

mdisibio commented Jan 29, 2024

mdisibio commented Feb 1, 2024 • edited Loading

mdisibio commented Feb 1, 2024 •

edited

Loading