-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure we use ti.queued_by_job_id #14795
Ensure we use ti.queued_by_job_id #14795
Conversation
Paging @dimberman |
It would be great to get a steer on whether this change is acceptable/safe, even if the PR isn't approved at this stage. |
The celery_executor also uses airflow/airflow/executors/celery_executor.py Lines 469 to 470 in 2a2adb3
Does it also experience these problems? |
The usage in Celery is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I understand the problem, but this doesn't quite feel like the right fix. I'm digging in a problem to look at what a better fix might be.
Additionally you will need to add unit tests the cover this case please.
Hi @ashb, Not sure we want to spend time adding unit tests if you don't feel this is the right fix. Do we think a better solution would be to merge |
Hi @samwedge There could be value in the unit tests anyway, as they should show the problem is fixed (i.e. that the task is not adopted tiwce/orphaned) without getting in to implementation details. But it's also cool if you want to hold off on doing this for now. |
Thanks. I agree if we can get a nice generic test in place that doesn't have any implementation detail, then it's worth adding in. That said, I've been looking at where to make the change. My first thought was in Alternatively, do you think this is a candidate for a system test? |
@ashb Just wondered if you had any further thoughts on my previous message. In particular, where the unit test might live. It will need to test the integration between |
@samwedge are you able to reproduce the error in a system test? I think if you can make it fail then a system test should be sufficient here. |
Sorry for the silence @dimberman, trying to find some time to work on this. I'm fine with the unit and integration tests, but have never run the system tests before. I'll take a look and drop a message on Slack if I have any issues. |
I was so wrong on this. I didn't realise that we are already re-setting ti.queued_by_job_id on adoption. (Given I wrote that code I probably should do. But, well, 2020 was looooong.) |
@samwedge @atrbgithub Can we merge autotraderuk#2 (from @jedcunningham ) in your branch and rebase the PR on master please so that we can merge this PR |
apache#14795 Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by adopt_launched_task when updating the labels. Without this, after restarting the scheduler a third time, the scheduler does not find the pods as it is still searching for the id of the original scheduler (ti.external_executor_id) Co-Authored-By: samwedge <[email protected]> Co-Authored-By: philip-hope <[email protected]> Co-Authored-By: Jed Cunningham <[email protected]>
Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by adopt_launched_task when updating the labels. Without this, after restarting the scheduler a third time, the scheduler does not find the pods as it is still searching for the id of the original scheduler (ti.external_executor_id) Co-Authored-By: samwedge <[email protected]> Co-Authored-By: philip-hope <[email protected]>
d3059bd
to
d5493df
Compare
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by adopt_launched_task when updating the labels. Without this, after restarting the scheduler a third time, the scheduler does not find the pods as it is still searching for the id of the original scheduler (ti.external_executor_id) Co-Authored-By: samwedge <[email protected]> Co-Authored-By: philip-hope <[email protected]> Co-authored-by: Jed Cunningham <[email protected]> (cherry picked from commit 344e829)
Ensure that we use ti.queued_by_job_id when searching for pods.
When a task is adopted by a new scheduler, the id of the current task is used:
airflow/airflow/executors/kubernetes_executor.py
Lines 620 to 623 in feb6b81
When this is successful the task instance is updated, and
queued_by_job_id
is updated with the id of the current scheduler:airflow/airflow/jobs/scheduler_job.py
Lines 1877 to 1878 in feb6b81
Therefore when we search for the pod labels on subsequent scheduler relaunches, we must search for pods using the
queued_by_job_id
and notexternal_executor_id
, as we are currently doing:airflow/airflow/executors/kubernetes_executor.py
Line 592 in feb6b81
external_executor_id
is static and never appears to be updated when tasks are adopted.This relates to #13808