Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-6175] fixes scheduler queue bug #6732

Merged
merged 8 commits into from
Dec 10, 2019

Conversation

dimberman
Copy link
Contributor

@dimberman dimberman commented Dec 5, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-6175
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

There is a bug caused by scheduler_jobs refactor which leads to task failure and scheduler locking.

Essentially when a there is an overflow of tasks going into the scheduler, the tasks are set back to scheduled, but are not removed from the executor's queued_tasks queue.

This means that the executor will attempt to run tasks that are in the scheduled state, but those tasks will fail dependency checks. Eventually the queue is filled with scheduled tasks, and the scheduler can no longer run.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@ashb
Copy link
Member

ashb commented Dec 5, 2019

Need a jira for this please @dimberman

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved
executor=executor,
subdir=os.path.join(settings.DAGS_FOLDER,
"no_dags.py"))
mock.patch.object(scheduler, '_change_state_for_tis_without_dagrun').start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to patch this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Otherwise the tasks won't run because they don't have an associated DagRun.

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved
@dimberman
Copy link
Contributor Author

@ashb this bug has a jira https://issues.apache.org/jira/browse/AIRFLOW-6175

@dimberman dimberman changed the title fixes scheduler queue bug [AIRFLOW-6175] fixes scheduler queue bug Dec 5, 2019
@kaxil
Copy link
Member

kaxil commented Dec 5, 2019

We have merge conflicts :( - can you fix that please

@dimberman dimberman force-pushed the AIRFLOW-6175_scheduler-queue-bug branch from 96c24aa to eb42612 Compare December 5, 2019 18:35
@dimberman
Copy link
Contributor Author

@kaxil fixed conflicts. @ashb addressed comments.

Copy link
Member

@KevinYang21 KevinYang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I'm wondering why the executor starts to run task in the scheduled state. Thought any task sent to the executor would be removed from queue_tasks and be moved to running_tasks and thus the state won't be set back to scheduled.

ti.state = State.SCHEDULED
tis.append(ti)
session.merge(ti)
session.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
session.commit()

create_session would commit the session when exiting the contextmanager

"no_dags.py"))
mock.patch.object(scheduler, '_change_state_for_tis_without_dagrun').start()
scheduler._process_dags(simple_dag_bag)
[ti.refresh_from_db() for ti in tis]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[ti.refresh_from_db() for ti in tis]
for ti in tis:
ti.refresh_from_db()

as we don't need to create list

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion, otherwise LGTM

@KevinYang21
Copy link
Member

KevinYang21 commented Dec 5, 2019

Let me answer my own question... Since we change from SCHEDULED to QUEUED only when the dag is in the simple_dagbag, thus those left over tasks in queued_tasks won't be set back to QUEUED until the next parse but stlll gonna be picked up by the executor, then cause the deps check to fail.

Thanks again for fixing my bad 🙏

And to understand the impact completely, the bug was causing scheduler/executor to slow down quite a lot when traffic >> executor capacity, but not completely blocked right? As those SCHEDULED tasks in queued_tasks would be slowly moved out by _execute_task_instances.

@kaxil
Copy link
Member

kaxil commented Dec 6, 2019

Following tests are failing:

FAILED tests/jobs/test_backfill_job.py::TestBackfillJob::test_backfill_examples_0_example_branch_operator
FAILED tests/jobs/test_backfill_job.py::TestBackfillJob::test_backfill_examples_1_example_bash_operator
FAILED tests/jobs/test_backfill_job.py::TestBackfillJob::test_backfill_examples_2_example_skip_dag
FAILED tests/jobs/test_backfill_job.py::TestBackfillJob::test_backfill_multi_dates
FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_still_in_executor
FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_scheduler_executor_overflow
FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_scheduler_reschedule
==== 7 failed, 3999 passed, 117 skipped, 16 warnings in 1481.02s (0:24:41) =====

@codecov-io
Copy link

codecov-io commented Dec 7, 2019

Codecov Report

Merging #6732 into master will decrease coverage by 0.22%.
The diff coverage is 78.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6732      +/-   ##
==========================================
- Coverage   84.45%   84.23%   -0.23%     
==========================================
  Files         672      672              
  Lines       38081    38089       +8     
==========================================
- Hits        32163    32084      -79     
- Misses       5918     6005      +87
Impacted Files Coverage Δ
airflow/jobs/scheduler_job.py 89.24% <78.26%> (-0.32%) ⬇️
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 45.25% <0%> (-46.72%) ⬇️
airflow/kubernetes/refresh_config.py 50.98% <0%> (-23.53%) ⬇️
...rflow/contrib/operators/kubernetes_pod_operator.py 78.2% <0%> (-20.52%) ⬇️
airflow/models/taskinstance.py 94.39% <0%> (-0.16%) ⬇️
airflow/jobs/backfill_job.py 90.72% <0%> (+0.28%) ⬆️
airflow/hooks/postgres_hook.py 94.36% <0%> (+1.4%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69629a5...db3c0f4. Read the comment docs.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Let's get it merged. We have some other "global" updates to this code which aim to increase readability (pylint + type annotations) so we should - I think - merge those fairly quickly.

@potiuk potiuk added the area:Scheduler including HA (high availability) scheduler label Dec 9, 2019
@dimberman
Copy link
Contributor Author

@potiuk @KevinYang21 wanna give a final lookover before merging?

@KevinYang21
Copy link
Member

KevinYang21 commented Dec 10, 2019

oops, just found a debug line that I forgot to remove c0ea272#diff-d7e6d11bfaeca087149ad577dd7704daR68. Otherwise LGTM. Will push a small commit to fix it. If we all good I can merge it after the CI passes.

@ashb ashb merged commit f3bb4c3 into apache:master Dec 10, 2019
kaxil pushed a commit to astronomer/airflow that referenced this pull request Dec 10, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
…6732)

(cherry-picked from commit f3bb4c3)

Co-Authored-By: Kaxil Naik <[email protected]>
Co-Authored-By: Kevin Yang <[email protected]>
kaxil added a commit that referenced this pull request Dec 17, 2019
…6732)

(cherry-picked from commit f3bb4c3)

Co-Authored-By: Kaxil Naik <[email protected]>
Co-Authored-By: Kevin Yang <[email protected]>
ashb pushed a commit that referenced this pull request Dec 19, 2019
…6732)

(cherry-picked from commit f3bb4c3)

Co-Authored-By: Kaxil Naik <[email protected]>
Co-Authored-By: Kevin Yang <[email protected]>
galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
…pache#6732)

There is a bug caused by scheduler_jobs refactor which leads to task failure
and scheduler locking.

Essentially when a there is an overflow of tasks going into the scheduler, the
tasks are set back to scheduled, but are not removed from the executor's
queued_tasks queue.

This means that the executor will attempt to run tasks that are in the scheduled
state, but those tasks will fail dependency checks. Eventually the queue is
filled with scheduled tasks, and the scheduler can no longer run.

Co-Authored-By: Kaxil Naik <[email protected]>, Kevin Yang <[email protected]>
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Apr 7, 2020
Original PR: apache/airflow#6732

Change-Id: I1f8d8c277c54a9b1e2546cf1b37502b64f6c0b5f
GitOrigin-RevId: 2651459c7994aae7dd3ef4808f00bb39293ea8a3
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jun 10, 2021
Original PR: apache/airflow#6732

Change-Id: I1f8d8c277c54a9b1e2546cf1b37502b64f6c0b5f
GitOrigin-RevId: 2651459c7994aae7dd3ef4808f00bb39293ea8a3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants