Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure we use ti.queued_by_job_id #14795

Merged
merged 3 commits into from
Apr 26, 2021

Conversation

atrbgithub
Copy link
Contributor

Ensure that we use ti.queued_by_job_id when searching for pods.

When a task is adopted by a new scheduler, the id of the current task is used:

self.log.info("attempting to adopt pod %s", pod.metadata.name)
pod.metadata.labels['airflow-worker'] = pod_generator.make_safe_label_value(
str(self.scheduler_job_id)
)

When this is successful the task instance is updated, and queued_by_job_id is updated with the id of the current scheduler:

for ti in set(tis_to_reset_or_adopt) - set(to_reset):
ti.queued_by_job_id = self.id

Therefore when we search for the pod labels on subsequent scheduler relaunches, we must search for pods using the queued_by_job_id and not external_executor_id, as we are currently doing:

scheduler_job_ids = [ti.external_executor_id for ti in tis]

external_executor_id is static and never appears to be updated when tasks are adopted.

This relates to #13808

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Mar 15, 2021
@ashb
Copy link
Member

ashb commented Mar 15, 2021

Paging @dimberman

@samwedge
Copy link
Contributor

It would be great to get a steer on whether this change is acceptable/safe, even if the PR isn't approved at this stage.

@eejbyfeldt
Copy link
Contributor

The celery_executor also uses external_executor_id in its implementation of try_adopt_task_instances:

if ti.external_executor_id is not None:
celery_tasks[ti.external_executor_id] = (AsyncResult(ti.external_executor_id), ti)

Does it also experience these problems?

@ashb
Copy link
Member

ashb commented Mar 17, 2021

The usage in Celery is correct. ti.external_exeuctor_id there is the Celery Task ID (a UUID) and it's how we keep track of what the celery task ID is.

ashb
ashb previously requested changes Mar 17, 2021
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I understand the problem, but this doesn't quite feel like the right fix. I'm digging in a problem to look at what a better fix might be.

Additionally you will need to add unit tests the cover this case please.

@samwedge
Copy link
Contributor

samwedge commented Mar 19, 2021

Hi @ashb,

Not sure we want to spend time adding unit tests if you don't feel this is the right fix.

Do we think a better solution would be to merge queued_by_job_id and external_executor_id? As far as I can see, they serve the same purpose.

@ashb
Copy link
Member

ashb commented Mar 19, 2021

Hi @samwedge There could be value in the unit tests anyway, as they should show the problem is fixed (i.e. that the task is not adopted tiwce/orphaned) without getting in to implementation details.

But it's also cool if you want to hold off on doing this for now.

@samwedge
Copy link
Contributor

samwedge commented Mar 19, 2021

Thanks. I agree if we can get a nice generic test in place that doesn't have any implementation detail, then it's worth adding in.

That said, I've been looking at where to make the change. My first thought was in test_kubernetes_executor.py as this is the executor we have changed. But really, the bug only shows itself when calling SchedulerJob.adopt_or_reset_orphaned_tasks(). Adding a test here (in test_scheduler_job.py) means setting up a KubernetesExecutor with a lot of mocking/patching. And I'm not sure this is the best place for it.

Alternatively, do you think this is a candidate for a system test?

@samwedge
Copy link
Contributor

@ashb Just wondered if you had any further thoughts on my previous message. In particular, where the unit test might live. It will need to test the integration between SchedulerJob and the KubernetesExecutor. I don't want to muddy the existing test files, which seem to test each in isolation.

@dimberman
Copy link
Contributor

@samwedge are you able to reproduce the error in a system test? I think if you can make it fail then a system test should be sufficient here.

@samwedge
Copy link
Contributor

samwedge commented Apr 6, 2021

@samwedge are you able to reproduce the error in a system test? I think if you can make it fail then a system test should be sufficient here.

Sorry for the silence @dimberman, trying to find some time to work on this. I'm fine with the unit and integration tests, but have never run the system tests before. I'll take a look and drop a message on Slack if I have any issues.

@ashb
Copy link
Member

ashb commented Apr 22, 2021

I was so wrong on this. I didn't realise that we are already re-setting ti.queued_by_job_id on adoption.

(Given I wrote that code I probably should do. But, well, 2020 was looooong.)

@kaxil
Copy link
Member

kaxil commented Apr 23, 2021

@samwedge @atrbgithub Can we merge autotraderuk#2 (from @jedcunningham ) in your branch and rebase the PR on master please so that we can merge this PR

kaxil pushed a commit to astronomer/airflow that referenced this pull request Apr 23, 2021
apache#14795

Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by
adopt_launched_task when updating the labels. Without this, after restarting the scheduler
a third time, the scheduler does not find the pods as it is still searching for the id of
the original scheduler (ti.external_executor_id)

Co-Authored-By: samwedge <[email protected]>
Co-Authored-By: philip-hope <[email protected]>
Co-Authored-By: Jed Cunningham <[email protected]>
atrbgithub and others added 3 commits April 26, 2021 09:17
Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by
adopt_launched_task when updating the labels. Without this, after restarting the scheduler
a third time, the scheduler does not find the pods as it is still searching for the id of
the original scheduler (ti.external_executor_id)

Co-Authored-By: samwedge <[email protected]>
Co-Authored-By: philip-hope <[email protected]>
@atrbgithub atrbgithub force-pushed the fix-incorrectly-orphaned-tasks branch from d3059bd to d5493df Compare April 26, 2021 08:34
@kaxil kaxil linked an issue Apr 26, 2021 that may be closed by this pull request
@kaxil kaxil requested a review from ashb April 26, 2021 11:08
@kaxil kaxil dismissed ashb’s stale review April 26, 2021 21:18

Stale review

@kaxil kaxil merged commit 344e829 into apache:master Apr 26, 2021
@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 26, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk potiuk added this to the Airflow 2.0.3 milestone May 9, 2021
potiuk pushed a commit that referenced this pull request May 9, 2021
Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by
adopt_launched_task when updating the labels. Without this, after restarting the scheduler
a third time, the scheduler does not find the pods as it is still searching for the id of
the original scheduler (ti.external_executor_id)

Co-Authored-By: samwedge <[email protected]>
Co-Authored-By: philip-hope <[email protected]>
Co-authored-by: Jed Cunningham <[email protected]>
(cherry picked from commit 344e829)
@ashb ashb modified the milestones: Airflow 2.0.3, Airflow 2.1 May 18, 2021
@samwedge samwedge deleted the fix-incorrectly-orphaned-tasks branch June 18, 2021 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task incorrectly marked as orphaned when using 2 schedulers
8 participants