Adding retries when starting kubernetes pods #15137

SamWheating · 2021-04-01T13:31:28Z

I'll rebase + move this into the CNCF provider package once #15165 is merged.

The Kubernetes API server relies on optimistic concurrency for smieltaneous API requests, so 409 errors are to be expected and should be handled by the application (there's a good explanation here under "Optimistically concurrent updates"). This PR adds retries to handle Kubernetes API Exceptions while trying to start a pod.

I've opted for a random exponential backoff since the issues we've been seeing have been the result of too many simultaneous requests, so using a retry without jitter could lead to just repeating the same race conditions over and over.

This will only retry in the event of HTTP 409 reponses from the kubernetes API, but can be easily extended if there's more Exceptions which should be handled.

I'll try to run some experiments this morning to replicate the original issue and confirm that this fixes things.

Experiments / Validation:

(All run on Airflow 2.0.1, Kubernetes version 1.18.15-gke.1500)

I wrote a simple DAG to launch 50 pods at the same time, and let it run for a few hours:

from datetime import timedelta

from airflow import models
from airflow import utils
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

NUM_TASKS = 50

dag = models.DAG(
    'data-infrastructure-examples.launch-many-pods',
    start_date=utils.dates.days_ago(1),
    max_active_runs=1,
    dagrun_timeout=timedelta(minutes=10),
    schedule_interval='*/5 * * * *',
    concurrency=NUM_TASKS
)

for i in range(NUM_TASKS):
    task = KubernetesPodOperator(namespace='data-infrastructure-examples',
                                            image="busybox",
                                            name="hello-world-pod",
                                            in_cluster=True,
                                            task_id=f"TASK_{i}",
                                            get_logs=True,
                                            dag=dag,
                                            arguments=['sleep', '60'],
                                            is_delete_operator_pod=True
                                            )

I let this run for a few hours and was able to replicate the sporadic failures due to 409s observed in the original issue:

At the same time I ran a different version with the fixes from this PR applied to the pod_launcher. No failures were observed out of the 1000+ containers launched:

SamWheating · 2021-04-01T18:23:46Z

Since this requires changes to the pod_launcher and not the CNCF provider package, would it be possible to get this into the 2.0.2 release @ashb ?

ashb · 2021-04-01T18:25:12Z

airflow/kubernetes/pod_launcher.py

+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),


What does ApiException cover? Would this also be the same error if you send invalid payload?

I believe so, I think it would also cover 5xx errors due to invalid credentials in which case there definitely isn't any point to retrying.

With this in mind, should we scope this retry down to only cover 409 errors?

I think so, yeah probably.

@ashb @SamWheating this might be a good reason to move the pod_launching code into the cncf.kubernetes package. I don't think K8sPodOperators should require dependencies on Airflow for these kinds of fixes.

Ok, I just pushed a commit to only retry on 409 ApiExceptions, let me know what you think.

Re: the location of this code - it looks like the pod launching code is also used by the KubernetesExecutor, so if you wanted to move the pod_launcher to the CNCF provider package you would likely need a copy within the Airflow package as well?

I think there are functions that are only used by the k8spodoperator/only used by the k8sexecutor, so wouldn't be TOO bad.

Yeah, I just looked a little more and it appears that you're right. Its probably not too big of a change then to move a subset of the pod_launcher file into the cncf provider package and update some imports and such, and I'd definitely be interested in helping with that / taking that on. Would you like to write up an issue for that, or shall I?

Regarding the issue in question, would you be OK with reviewing/merging this PR in the meantime? A larger refactor might take a while to get properly reviewed and this issue is causing a lot of failures on our end.

@SamWheating I've created an issue here #15164. Please comment on it and I will assign it to you :).

I'd say you should attack this quickly, as without this fix we won't be able to release this fix until the next Airflow release (and it will require an Airflow upgrade).

That said, I'm glad to make this PR a high priority, so once it's ready I can be fast with PR reviews to get it through sooner than later.

github-actions · 2021-04-03T12:41:39Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

Currently, the KubernetesPodOperator uses the pod_launcher class in airflow core. This means that if we need to fix a bug in the KubernetesPodOperator such as apache#15137 then the new cncf.kubernetes package will require an Airflow upgrade. Since we hope to release providers in a much faster cadence than Airflow core releases, we should separate this dependency.

* Separate pod_launcher from core airflow Currently, the KubernetesPodOperator uses the pod_launcher class in airflow core. This means that if we need to fix a bug in the KubernetesPodOperator such as #15137 then the new cncf.kubernetes package will require an Airflow upgrade. Since we hope to release providers in a much faster cadence than Airflow core releases, we should separate this dependency. * fix podlauncher * remove warnings from pod_launcher * fix tests * add deprecated class * fix * fix import * one more nit * fix docs * fix docs again

SamWheating · 2021-04-06T20:26:32Z

@dimberman - thanks for getting that refactor in so quickly 🎉

I've rebased on master and moved my fix into the CNCF provider so this should be ready to re-review.

potiuk · 2021-04-06T20:34:32Z

should we merge that one and include it in the provider release :P ?

SamWheating requested a review from dimberman as a code owner April 1, 2021 13:31

boring-cyborg bot added the provider:cncf-kubernetes Kubernetes provider related issues label Apr 1, 2021

ashb reviewed Apr 1, 2021

View reviewed changes

dimberman mentioned this pull request Apr 2, 2021

Separate K8sPodOperator code from airflow core code #15164

Closed

dimberman mentioned this pull request Apr 2, 2021

Separate Kubernetes pod_launcher from core airflow #15165

Merged

mik-laj approved these changes Apr 3, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 3, 2021

Retry pod launching on 409 ApiExceptions

fb5f077

SamWheating force-pushed the sw-backoff-on-409-error-k8s-pod-operator branch from 9d27b0c to fb5f077 Compare April 6, 2021 20:24

potiuk approved these changes Apr 6, 2021

View reviewed changes

potiuk merged commit 1806670 into apache:master Apr 6, 2021

SamWheating mentioned this pull request Apr 12, 2021

Testing Providers prepared 2021.04.07 #15241

Closed

14 tasks

raphaelauv mentioned this pull request Apr 23, 2024

Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator #39201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding retries when starting kubernetes pods #15137

Adding retries when starting kubernetes pods #15137

SamWheating commented Apr 1, 2021 •

edited

Loading

SamWheating commented Apr 1, 2021

ashb Apr 1, 2021

SamWheating Apr 1, 2021

ashb Apr 1, 2021

dimberman Apr 1, 2021

SamWheating Apr 1, 2021 •

edited

Loading

dimberman Apr 2, 2021

SamWheating Apr 2, 2021

dimberman Apr 2, 2021

github-actions bot commented Apr 3, 2021

SamWheating commented Apr 6, 2021

potiuk commented Apr 6, 2021

Adding retries when starting kubernetes pods #15137

Adding retries when starting kubernetes pods #15137

Conversation

SamWheating commented Apr 1, 2021 • edited Loading

Experiments / Validation:

SamWheating commented Apr 1, 2021

ashb Apr 1, 2021

Choose a reason for hiding this comment

SamWheating Apr 1, 2021

Choose a reason for hiding this comment

ashb Apr 1, 2021

Choose a reason for hiding this comment

dimberman Apr 1, 2021

Choose a reason for hiding this comment

SamWheating Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

dimberman Apr 2, 2021

Choose a reason for hiding this comment

SamWheating Apr 2, 2021

Choose a reason for hiding this comment

dimberman Apr 2, 2021

Choose a reason for hiding this comment

github-actions bot commented Apr 3, 2021

SamWheating commented Apr 6, 2021

potiuk commented Apr 6, 2021

SamWheating commented Apr 1, 2021 •

edited

Loading

SamWheating Apr 1, 2021 •

edited

Loading