FIX Slow cleared tasks would be adopted by Celery. #16718

Jorricks · 2021-06-29T18:58:37Z

Celery executor is currently adopting anything that has ever run before and has been cleared since then.

Example of the issue:
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever without this PR. However, they should have never been adopted in the first place.

Contents of the PR:

Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
Given this task instance external_executor_id is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

Jorricks · 2021-06-29T19:47:47Z

I was thinking if we can improve the query somehow by checking if the orphaned task was from one of the executors that died. What do you think?

ashb · 2021-06-30T07:51:06Z

airflow/models/taskinstance.py

@@ -172,6 +172,7 @@ def clear_task_instances(
                # original max_tries or the last attempted try number.
                ti.max_tries = max(ti.max_tries, ti.prev_attempted_tries)
            ti.state = State.NONE
+            ti.external_executor_id = None


Could you check this in the unit tests so we don't regress

Good one! Will add it.

Updated the current test to also verify the external_executor_id is reset.

ashb · 2021-06-30T07:52:45Z

I was thinking if we can improve the query somehow by checking if the orphaned task was from one of the executors that died. What do you think?

We could if we need to - but the reason I didn't is double adoption: if a scheduler adopts tasks and then dies we want all those tasks (it's own and the ones it adopted) to be adopted again

Jorricks · 2021-06-30T08:32:31Z

I was thinking if we can improve the query somehow by checking if the orphaned task was from one of the executors that died. What do you think?

We could if we need to - but the reason I didn't is double adoption: if a scheduler adopts tasks and then dies we want all those tasks (it's own and the ones it adopted) to be adopted again

Yes alright. It sounds like we should just make sure that cleared tasks are not being picked up.

Jorricks · 2021-07-01T06:25:23Z

The test failures seem random. Can someone rerun the failing parts :)?
I don't fully grasp the failing test. Can anyone explain why we are checking for orphaned processes?

potiuk · 2021-07-01T08:00:49Z

@Jorricks We seem to have general problem with kind tests in GitHub public runners #16736, for the other - you can do it yourself:

the best way: rebase to latest main and git push --force-with-lease - this will make sure you use latest main code
alternatively (if you are already at main) git commit --amend and force push again
finally - close and reopen the PR. This will trigger the rebuild.

github-actions · 2021-07-01T09:00:33Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

kaxil

Well done 👏

kaxil · 2021-07-01T21:54:13Z

Have rebased the PR on Main 🤞

Celery executor is currently adopting anything that has ever run before and has been cleared since then. **Example of the issue:** We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place. **Contents of the PR**: 1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted. 2. Given this task instance `external_executor_id` is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance. (cherry picked from commit 554a239)

Jorricks requested review from ashb, kaxil, turbaszek and XD-DENG as code owners June 29, 2021 18:58

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jun 29, 2021

Jorricks marked this pull request as draft June 29, 2021 19:40

Jorricks marked this pull request as ready for review June 29, 2021 20:25

Jorricks changed the title ~~Slow (cleared) tasks would be adopted by Celery.~~ FIX Slow cleared tasks would be adopted by Celery. Jun 29, 2021

ashb reviewed Jun 30, 2021

View reviewed changes

Jorricks force-pushed the celery-slow-task-fix branch from 6763996 to 0281ed7 Compare June 30, 2021 21:08

ashb approved these changes Jul 1, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jul 1, 2021

kaxil approved these changes Jul 1, 2021

View reviewed changes

kaxil added this to the Airflow 2.1.2 milestone Jul 1, 2021

Fix slow (cleared) tasks being be adopted by Celery worker.

73cd253

kaxil force-pushed the celery-slow-task-fix branch from 0281ed7 to 73cd253 Compare July 1, 2021 21:55

kaxil merged commit 554a239 into apache:main Jul 2, 2021

ashb modified the milestones: Airflow 2.1.2, Airflow 2.1.3 Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Slow cleared tasks would be adopted by Celery. #16718

FIX Slow cleared tasks would be adopted by Celery. #16718

Jorricks commented Jun 29, 2021 •

edited

Loading

Jorricks commented Jun 29, 2021 •

edited

Loading

ashb Jun 30, 2021

Jorricks Jun 30, 2021

Jorricks Jun 30, 2021

ashb commented Jun 30, 2021

Jorricks commented Jun 30, 2021

Jorricks commented Jul 1, 2021 •

edited

Loading

potiuk commented Jul 1, 2021

github-actions bot commented Jul 1, 2021

kaxil left a comment

kaxil commented Jul 1, 2021 •

edited

Loading

FIX Slow cleared tasks would be adopted by Celery. #16718

FIX Slow cleared tasks would be adopted by Celery. #16718

Conversation

Jorricks commented Jun 29, 2021 • edited Loading

Jorricks commented Jun 29, 2021 • edited Loading

ashb Jun 30, 2021

Choose a reason for hiding this comment

Jorricks Jun 30, 2021

Choose a reason for hiding this comment

Jorricks Jun 30, 2021

Choose a reason for hiding this comment

ashb commented Jun 30, 2021

Jorricks commented Jun 30, 2021

Jorricks commented Jul 1, 2021 • edited Loading

potiuk commented Jul 1, 2021

github-actions bot commented Jul 1, 2021

kaxil left a comment

Choose a reason for hiding this comment

kaxil commented Jul 1, 2021 • edited Loading

Jorricks commented Jun 29, 2021 •

edited

Loading

Jorricks commented Jun 29, 2021 •

edited

Loading

Jorricks commented Jul 1, 2021 •

edited

Loading

kaxil commented Jul 1, 2021 •

edited

Loading