Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Slow cleared tasks would be adopted by Celery. #16718

Merged
merged 1 commit into from
Jul 2, 2021

Conversation

Jorricks
Copy link
Contributor

@Jorricks Jorricks commented Jun 29, 2021

Celery executor is currently adopting anything that has ever run before and has been cleared since then.

Example of the issue:
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever without this PR. However, they should have never been adopted in the first place.

Contents of the PR:

  1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
  2. Given this task instance external_executor_id is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jun 29, 2021
@Jorricks Jorricks marked this pull request as draft June 29, 2021 19:40
@Jorricks
Copy link
Contributor Author

Jorricks commented Jun 29, 2021

I was thinking if we can improve the query somehow by checking if the orphaned task was from one of the executors that died. What do you think?

@Jorricks Jorricks marked this pull request as ready for review June 29, 2021 20:25
@Jorricks Jorricks changed the title Slow (cleared) tasks would be adopted by Celery. FIX Slow cleared tasks would be adopted by Celery. Jun 29, 2021
@@ -172,6 +172,7 @@ def clear_task_instances(
# original max_tries or the last attempted try number.
ti.max_tries = max(ti.max_tries, ti.prev_attempted_tries)
ti.state = State.NONE
ti.external_executor_id = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check this in the unit tests so we don't regress

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one! Will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the current test to also verify the external_executor_id is reset.

@ashb
Copy link
Member

ashb commented Jun 30, 2021

I was thinking if we can improve the query somehow by checking if the orphaned task was from one of the executors that died. What do you think?

We could if we need to - but the reason I didn't is double adoption: if a scheduler adopts tasks and then dies we want all those tasks (it's own and the ones it adopted) to be adopted again

@Jorricks
Copy link
Contributor Author

I was thinking if we can improve the query somehow by checking if the orphaned task was from one of the executors that died. What do you think?

We could if we need to - but the reason I didn't is double adoption: if a scheduler adopts tasks and then dies we want all those tasks (it's own and the ones it adopted) to be adopted again

Yes alright. It sounds like we should just make sure that cleared tasks are not being picked up.

@Jorricks
Copy link
Contributor Author

Jorricks commented Jul 1, 2021

The test failures seem random. Can someone rerun the failing parts :)?
I don't fully grasp the failing test. Can anyone explain why we are checking for orphaned processes?

@potiuk
Copy link
Member

potiuk commented Jul 1, 2021

@Jorricks We seem to have general problem with kind tests in GitHub public runners #16736, for the other - you can do it yourself:

  1. the best way: rebase to latest main and git push --force-with-lease - this will make sure you use latest main code
  2. alternatively (if you are already at main) git commit --amend and force push again
  3. finally - close and reopen the PR. This will trigger the rebuild.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jul 1, 2021
@github-actions
Copy link

github-actions bot commented Jul 1, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done 👏

@kaxil kaxil added this to the Airflow 2.1.2 milestone Jul 1, 2021
@kaxil
Copy link
Member

kaxil commented Jul 1, 2021

Have rebased the PR on Main 🤞

@kaxil kaxil merged commit 554a239 into apache:main Jul 2, 2021
@ashb ashb modified the milestones: Airflow 2.1.2, Airflow 2.1.3 Jul 7, 2021
jhtimmins pushed a commit that referenced this pull request Aug 9, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
jhtimmins pushed a commit that referenced this pull request Aug 13, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
kaxil pushed a commit that referenced this pull request Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
jhtimmins pushed a commit that referenced this pull request Aug 17, 2021
Celery executor is currently adopting anything that has ever run before and has been cleared since then.

**Example of the issue:**
We have a DAG that runs over 150 sensor tasks and 50 ETL tasks while having a concurrency of 3 and max_active_runs of 16. This setup is required because we want to divide the resources and we don't want this DAG to take up all the resources. What will happen is that many tasks will be in scheduled for a bit as it can't queue them due to the concurrency of 3. However, because of the current implementations, if these tasks ever run before, they would get adopted by the schedulers executor instance and become stuck forever [without this PR](#16550). However, they should have never been adopted in the first place.

**Contents of the PR**:
1. Tasks that are in scheduled should never have arrived at an executor. Hence, we remove the task state scheduled from the option to be adopted.
2. Given this task instance `external_executor_id`  is quite important in deciding whether it is adopted, we will also reset this when we reset the state of the TaskInstance.

(cherry picked from commit 554a239)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants