You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
After a crash or unclean restart (OOM or readinessprobe failure) an ingester can enter a bad state where it is unable to write traces to the headblock. It continually logs the following error and is unable to recover on its own. The error means that all incoming data for the tenant is lost because it cannot be persisted to disk.
msg="failed to cut traces" err="error writing meta json: open /var/tempo/wal/<guid>+<tenant>+vParquet3/meta.json: no such file or directory"
RF=3 helps mitigate this issue because it prevents data loss from 1 failing ingester. However not if there are 2 or more.
To Reproduce
Steps to reproduce the behavior:
Start Tempo main
Ingesters get killed for various reasons: OOMs or readiness probe failures.
Large numbers of tenants, and pressure on the read path such as search and autocomplete lookups increase the likelihood of this.
Eventually an ingester will start in this state for 1 or more tenants.
Expected behavior
Ingesters are able to recover from all crashes and unclean restarts. Existing data in memory or partially written blocks might be lost, but the ingester should always be able to receive new data.
Environment:
Infrastructure: Kubernetes
Deployment tool: jsonnet
Additional Context
The text was updated successfully, but these errors were encountered:
Updating findings: In one case the issue was caused by the ingester replaying (and deleting) the active headblock (logs failed to replay block. removing. Then the no such file or directory error happens on the next flush because the headblock directory is gone. Receiving traffic and flushing should not happen until replay is complete, so the issue seems to related to timing on startup.
Describe the bug
After a crash or unclean restart (OOM or readinessprobe failure) an ingester can enter a bad state where it is unable to write traces to the headblock. It continually logs the following error and is unable to recover on its own. The error means that all incoming data for the tenant is lost because it cannot be persisted to disk.
RF=3 helps mitigate this issue because it prevents data loss from 1 failing ingester. However not if there are 2 or more.
To Reproduce
Steps to reproduce the behavior:
main
Expected behavior
Ingesters are able to recover from all crashes and unclean restarts. Existing data in memory or partially written blocks might be lost, but the ingester should always be able to receive new data.
Environment:
Additional Context
The text was updated successfully, but these errors were encountered: