Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4956] Fix LocalTaskJob heartbeat log spamming #5589

Merged
merged 1 commit into from
Aug 8, 2019

Conversation

KevinYang21
Copy link
Member

@KevinYang21 KevinYang21 commented Jul 15, 2019

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

If there's an exception in LocalTaskJob, e.g. DB connectivity error, the job will not wait before attempting the next heartbeat, causing the job to spam retry attempts and task logging.

This PR will unify the usage of BaseJob.heartbeat() so that exceptions are handled all inside the method instead of caller.

This PR also replaces local_task_job_heartbeat_fail metric with lower(JobClassName)_heartbeat_failure metrics so all jobs can have StatsD metrics on heartbeat failures.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    tests/jobs/test_local_task_job.py. test_heartbeat_failed_fast

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@KevinYang21 KevinYang21 force-pushed the kevin_yang_log_spam branch 2 times, most recently from 954f068 to dfe2bf0 Compare July 15, 2019 10:51
@@ -192,7 +192,7 @@ def heartbeat(self):
self.heartbeat_callback(session=session)
self.log.debug('[heartbeat]')
except OperationalError as e:
self.log.error("Scheduler heartbeat got an exception: %s", str(e))
self.log.exception("%s heartbeat got an exception: %s", self.__class__, str(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also include the stack trace of the exception - was that intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, already printing it previously and just a small touch up to use log.exception( also removed the redundant str(e) in the draft).

airflow/jobs/local_task_job.py Show resolved Hide resolved
@KevinYang21 KevinYang21 force-pushed the kevin_yang_log_spam branch 4 times, most recently from 7dd3752 to 8d4ddff Compare July 15, 2019 19:34
@codecov-io
Copy link

codecov-io commented Jul 15, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@3ee2dcb). Click here to learn what that means.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5589   +/-   ##
=========================================
  Coverage          ?   79.06%           
=========================================
  Files             ?      489           
  Lines             ?    30681           
  Branches          ?        0           
=========================================
  Hits              ?    24259           
  Misses            ?     6422           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/jobs/base_job.py 87.05% <0%> (ø)
airflow/jobs/local_task_job.py 90% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ee2dcb...1be47ef. Read the comment docs.

@KevinYang21 KevinYang21 changed the title [WIP][AIRFLOW-4956] Fix LocalTaskJob heartbeat log spamming [AIRFLOW-4956] Fix LocalTaskJob heartbeat log spamming Jul 17, 2019
@KevinYang21
Copy link
Member Author

@ashb Updated accordingly, PTAL 🙏

UPDATING.md Outdated Show resolved Hide resolved
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once you add example to updating, or do something to maintain the existing metric format.

@KevinYang21 KevinYang21 force-pushed the kevin_yang_log_spam branch 2 times, most recently from c760c02 to 2744669 Compare July 31, 2019 20:17
@KevinYang21
Copy link
Member Author

@ashb Updated with backward compatible approach, PTAL

@ashb
Copy link
Member

ashb commented Aug 1, 2019

(Will merge once we fix master tests)

@KevinYang21
Copy link
Member Author

@ashb seems like recent two commits are passing master CI. Is master CI fixed already?

@potiuk potiuk merged commit e07e304 into apache:master Aug 8, 2019
ashb pushed a commit to ashb/airflow that referenced this pull request Oct 10, 2019
cong-zhu pushed a commit to cong-zhu/airflow that referenced this pull request Feb 6, 2020
@kaxil kaxil modified the milestone: Airflow 1.10.11 Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants