Ensure we use ti.queued_by_job_id #14795

atrbgithub · 2021-03-15T09:30:26Z

Ensure that we use ti.queued_by_job_id when searching for pods.

When a task is adopted by a new scheduler, the id of the current task is used:

airflow/airflow/executors/kubernetes_executor.py

Lines 620 to 623 in feb6b81

    
           self.log.info("attempting to adopt pod %s", pod.metadata.name) 
        
           pod.metadata.labels['airflow-worker'] = pod_generator.make_safe_label_value( 
        
               str(self.scheduler_job_id) 
        
           )

When this is successful the task instance is updated, and queued_by_job_id is updated with the id of the current scheduler:

airflow/airflow/jobs/scheduler_job.py

Lines 1877 to 1878 in feb6b81

    
           for ti in set(tis_to_reset_or_adopt) - set(to_reset): 
        
               ti.queued_by_job_id = self.id

Therefore when we search for the pod labels on subsequent scheduler relaunches, we must search for pods using the queued_by_job_id and not external_executor_id, as we are currently doing:

airflow/airflow/executors/kubernetes_executor.py

Line 592 in feb6b81

scheduler_job_ids = [ti.external_executor_id for ti in tis]

external_executor_id is static and never appears to be updated when tasks are adopted.

This relates to #13808

ashb · 2021-03-15T14:57:45Z

Paging @dimberman

samwedge · 2021-03-16T16:01:29Z

It would be great to get a steer on whether this change is acceptable/safe, even if the PR isn't approved at this stage.

eejbyfeldt · 2021-03-17T15:24:09Z

The celery_executor also uses external_executor_id in its implementation of try_adopt_task_instances:

airflow/airflow/executors/celery_executor.py

Lines 469 to 470 in 2a2adb3

    
           if ti.external_executor_id is not None: 
        
               celery_tasks[ti.external_executor_id] = (AsyncResult(ti.external_executor_id), ti)

Does it also experience these problems?

ashb · 2021-03-17T16:37:47Z

The usage in Celery is correct. ti.external_exeuctor_id there is the Celery Task ID (a UUID) and it's how we keep track of what the celery task ID is.

ashb

Okay, I understand the problem, but this doesn't quite feel like the right fix. I'm digging in a problem to look at what a better fix might be.

Additionally you will need to add unit tests the cover this case please.

samwedge · 2021-03-19T10:08:45Z

Hi @ashb,

Not sure we want to spend time adding unit tests if you don't feel this is the right fix.

Do we think a better solution would be to merge queued_by_job_id and external_executor_id? As far as I can see, they serve the same purpose.

ashb · 2021-03-19T10:58:01Z

Hi @samwedge There could be value in the unit tests anyway, as they should show the problem is fixed (i.e. that the task is not adopted tiwce/orphaned) without getting in to implementation details.

But it's also cool if you want to hold off on doing this for now.

samwedge · 2021-03-19T15:55:06Z

Thanks. I agree if we can get a nice generic test in place that doesn't have any implementation detail, then it's worth adding in.

That said, I've been looking at where to make the change. My first thought was in test_kubernetes_executor.py as this is the executor we have changed. But really, the bug only shows itself when calling SchedulerJob.adopt_or_reset_orphaned_tasks(). Adding a test here (in test_scheduler_job.py) means setting up a KubernetesExecutor with a lot of mocking/patching. And I'm not sure this is the best place for it.

Alternatively, do you think this is a candidate for a system test?

samwedge · 2021-03-29T11:24:25Z

@ashb Just wondered if you had any further thoughts on my previous message. In particular, where the unit test might live. It will need to test the integration between SchedulerJob and the KubernetesExecutor. I don't want to muddy the existing test files, which seem to test each in isolation.

dimberman · 2021-03-29T18:56:09Z

@samwedge are you able to reproduce the error in a system test? I think if you can make it fail then a system test should be sufficient here.

samwedge · 2021-04-06T12:41:45Z

@samwedge are you able to reproduce the error in a system test? I think if you can make it fail then a system test should be sufficient here.

Sorry for the silence @dimberman, trying to find some time to work on this. I'm fine with the unit and integration tests, but have never run the system tests before. I'll take a look and drop a message on Slack if I have any issues.

ashb · 2021-04-22T16:41:56Z

I was so wrong on this. I didn't realise that we are already re-setting ti.queued_by_job_id on adoption.

(Given I wrote that code I probably should do. But, well, 2020 was looooong.)

kaxil · 2021-04-23T21:43:35Z

@samwedge @atrbgithub Can we merge autotraderuk#2 (from @jedcunningham ) in your branch and rebase the PR on master please so that we can merge this PR

apache#14795 Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by adopt_launched_task when updating the labels. Without this, after restarting the scheduler a third time, the scheduler does not find the pods as it is still searching for the id of the original scheduler (ti.external_executor_id) Co-Authored-By: samwedge <[email protected]> Co-Authored-By: philip-hope <[email protected]> Co-Authored-By: Jed Cunningham <[email protected]>

Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by adopt_launched_task when updating the labels. Without this, after restarting the scheduler a third time, the scheduler does not find the pods as it is still searching for the id of the original scheduler (ti.external_executor_id) Co-Authored-By: samwedge <[email protected]> Co-Authored-By: philip-hope <[email protected]>

Stale review

github-actions · 2021-04-26T21:19:03Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

Ensure that we use ti.queued_by_job_id when searching for pods. The queued_by_job_id is used by adopt_launched_task when updating the labels. Without this, after restarting the scheduler a third time, the scheduler does not find the pods as it is still searching for the id of the original scheduler (ti.external_executor_id) Co-Authored-By: samwedge <[email protected]> Co-Authored-By: philip-hope <[email protected]> Co-authored-by: Jed Cunningham <[email protected]> (cherry picked from commit 344e829)

atrbgithub requested review from ashb, kaxil, turbaszek and XD-DENG as code owners March 15, 2021 09:30

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Mar 15, 2021

atrbgithub mentioned this pull request Mar 15, 2021

Task incorrectly marked as orphaned when using 2 schedulers #13808

Closed

ashb previously requested changes Mar 17, 2021

View reviewed changes

jedcunningham mentioned this pull request Apr 23, 2021

Fix pod adoption tests autotraderuk/airflow#2

Merged

atrbgithub and others added 3 commits April 26, 2021 09:17

Only try to adopt from a scheduler_job_id once

838ecb4

Add test coverage for KubernetesExecutor.test_try_adopt_task_instances

d5493df

atrbgithub force-pushed the fix-incorrectly-orphaned-tasks branch from d3059bd to d5493df Compare April 26, 2021 08:34

kaxil linked an issue Apr 26, 2021 that may be closed by this pull request

Task incorrectly marked as orphaned when using 2 schedulers #13808

Closed

kaxil requested a review from ashb April 26, 2021 11:08

kaxil approved these changes Apr 26, 2021

View reviewed changes

kaxil merged commit 344e829 into apache:master Apr 26, 2021

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 26, 2021

potiuk added this to the Airflow 2.0.3 milestone May 9, 2021

ashb modified the milestones: Airflow 2.0.3, Airflow 2.1 May 18, 2021

samwedge deleted the fix-incorrectly-orphaned-tasks branch June 18, 2021 21:38

bparhy mentioned this pull request Dec 10, 2021

Airflow scheduler with Kubernetes executor trying to adopt pod from other deployment #20203

Closed

2 tasks

sdseaton mentioned this pull request Aug 28, 2022

backport of fix for scheduler adoption github/incubator-airflow#67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure we use ti.queued_by_job_id #14795

Ensure we use ti.queued_by_job_id #14795

atrbgithub commented Mar 15, 2021

ashb commented Mar 15, 2021

samwedge commented Mar 16, 2021

eejbyfeldt commented Mar 17, 2021

ashb commented Mar 17, 2021

ashb left a comment

samwedge commented Mar 19, 2021 •

edited

Loading

ashb commented Mar 19, 2021

samwedge commented Mar 19, 2021 •

edited

Loading

samwedge commented Mar 29, 2021

dimberman commented Mar 29, 2021

samwedge commented Apr 6, 2021

ashb commented Apr 22, 2021

kaxil commented Apr 23, 2021 •

edited

Loading

github-actions bot commented Apr 26, 2021

	self.log.info("attempting to adopt pod %s", pod.metadata.name)
	pod.metadata.labels['airflow-worker'] = pod_generator.make_safe_label_value(
	str(self.scheduler_job_id)
	)

	for ti in set(tis_to_reset_or_adopt) - set(to_reset):
	ti.queued_by_job_id = self.id

Ensure we use ti.queued_by_job_id #14795

Ensure we use ti.queued_by_job_id #14795

Conversation

atrbgithub commented Mar 15, 2021

ashb commented Mar 15, 2021

samwedge commented Mar 16, 2021

eejbyfeldt commented Mar 17, 2021

ashb commented Mar 17, 2021

ashb left a comment

Choose a reason for hiding this comment

samwedge commented Mar 19, 2021 • edited Loading

ashb commented Mar 19, 2021

samwedge commented Mar 19, 2021 • edited Loading

samwedge commented Mar 29, 2021

dimberman commented Mar 29, 2021

samwedge commented Apr 6, 2021

ashb commented Apr 22, 2021

kaxil commented Apr 23, 2021 • edited Loading

github-actions bot commented Apr 26, 2021

samwedge commented Mar 19, 2021 •

edited

Loading

samwedge commented Mar 19, 2021 •

edited

Loading

kaxil commented Apr 23, 2021 •

edited

Loading