log-groomer container crashing with .Values.logs.persistence enabled #37220
Labels
area:core
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
With logs persistence enabled (logs.persistence.enabled), all airflow components are storing their logs on a shared volume.
Default behavior is to clean up old logs after an amount of days.
When 2 or more containers attempt to clean the same logs, log-groomer containers crash with the following messages:
This is a pretty similar bug than solved with this pull request: https://github.com/apache/airflow/pull/36050/files
The issue arises right on the previous command when either
find
orrm
command fail.Error message pointing to no such file or directory indicates another container has already removed the file.
Error message with device busy points that another container is performing an operation.
What you think should happen instead?
Failures on both find/rm commands can be safely ignored since the cleanup has already been done by another container.
How to reproduce
Install airflow via helm official chart and set
logs.persistence.enabled
true.Then, it's just a matter of waiting few days generating logs (tasks running) until the race condition appears.
On my environment with multiple replicas per component, 5 dags and 72 tasks, this is happening every 1 or 2 days randomly:
I guess this will happen less with single replica and less tasks.
Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: