Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log-groomer container crashing with .Values.logs.persistence enabled #37220

Closed
2 tasks done
arovira opened this issue Feb 7, 2024 · 1 comment · Fixed by #37222
Closed
2 tasks done

log-groomer container crashing with .Values.logs.persistence enabled #37220

arovira opened this issue Feb 7, 2024 · 1 comment · Fixed by #37222
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@arovira
Copy link
Contributor

arovira commented Feb 7, 2024

Apache Airflow version

2.8.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

With logs persistence enabled (logs.persistence.enabled), all airflow components are storing their logs on a shared volume.

Default behavior is to clean up old logs after an amount of days.
When 2 or more containers attempt to clean the same logs, log-groomer containers crash with the following messages:

find: ‘/opt/airflow/logs/dag_id=findings_sync/run_id=scheduled__2024-01-19T00:00:00+00:00’: No such file or directory
rm: cannot remove '/opt/airflow/logs/dag_id=findings_sync_clean/run_id=scheduled__2024-01-19T00:00:00+00:00/task_id=start_sync_clean/attempt=1.log': Device or resource busy

This is a pretty similar bug than solved with this pull request: https://github.com/apache/airflow/pull/36050/files

The issue arises right on the previous command when either find or rm command fail.
Error message pointing to no such file or directory indicates another container has already removed the file.
Error message with device busy points that another container is performing an operation.

What you think should happen instead?

Failures on both find/rm commands can be safely ignored since the cleanup has already been done by another container.

How to reproduce

Install airflow via helm official chart and set logs.persistence.enabled true.

Then, it's just a matter of waiting few days generating logs (tasks running) until the race condition appears.
On my environment with multiple replicas per component, 5 dags and 72 tasks, this is happening every 1 or 2 days randomly:
image

I guess this will happen less with single replica and less tasks.

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@arovira arovira added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Feb 7, 2024
Copy link

boring-cyborg bot commented Feb 7, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant