-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4956] Fix LocalTaskJob heartbeat log spamming #5589
Conversation
954f068
to
dfe2bf0
Compare
airflow/jobs/base_job.py
Outdated
@@ -192,7 +192,7 @@ def heartbeat(self): | |||
self.heartbeat_callback(session=session) | |||
self.log.debug('[heartbeat]') | |||
except OperationalError as e: | |||
self.log.error("Scheduler heartbeat got an exception: %s", str(e)) | |||
self.log.exception("%s heartbeat got an exception: %s", self.__class__, str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also include the stack trace of the exception - was that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, already printing it previously and just a small touch up to use log.exception
( also removed the redundant str(e)
in the draft).
7dd3752
to
8d4ddff
Compare
Codecov Report
@@ Coverage Diff @@
## master #5589 +/- ##
=========================================
Coverage ? 79.06%
=========================================
Files ? 489
Lines ? 30681
Branches ? 0
=========================================
Hits ? 24259
Misses ? 6422
Partials ? 0
Continue to review full report at Codecov.
|
8d4ddff
to
1be47ef
Compare
@ashb Updated accordingly, PTAL 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once you add example to updating, or do something to maintain the existing metric format.
c760c02
to
2744669
Compare
2744669
to
3a53dec
Compare
@ashb Updated with backward compatible approach, PTAL |
(Will merge once we fix master tests) |
@ashb seems like recent two commits are passing master CI. Is master CI fixed already? |
(cherry picked from commit e07e304)
Make sure you have checked all steps below.
Jira
Description
If there's an exception in LocalTaskJob, e.g. DB connectivity error, the job will not wait before attempting the next heartbeat, causing the job to spam retry attempts and task logging.
This PR will unify the usage of
BaseJob.heartbeat()
so that exceptions are handled all inside the method instead of caller.This PR also replaces
local_task_job_heartbeat_fail
metric withlower(JobClassName)_heartbeat_failure
metrics so all jobs can have StatsD metrics on heartbeat failures.Tests
tests/jobs/test_local_task_job.py. test_heartbeat_failed_fast
Commits
Documentation
Code Quality
flake8