Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding retries when starting kubernetes pods #15137

Merged

Conversation

SamWheating
Copy link
Contributor

@SamWheating SamWheating commented Apr 1, 2021

closes: #15097

I'll rebase + move this into the CNCF provider package once #15165 is merged.

The Kubernetes API server relies on optimistic concurrency for smieltaneous API requests, so 409 errors are to be expected and should be handled by the application (there's a good explanation here under "Optimistically concurrent updates"). This PR adds retries to handle Kubernetes API Exceptions while trying to start a pod.

I've opted for a random exponential backoff since the issues we've been seeing have been the result of too many simultaneous requests, so using a retry without jitter could lead to just repeating the same race conditions over and over.

This will only retry in the event of HTTP 409 reponses from the kubernetes API, but can be easily extended if there's more Exceptions which should be handled.

I'll try to run some experiments this morning to replicate the original issue and confirm that this fixes things.

Experiments / Validation:

(All run on Airflow 2.0.1, Kubernetes version 1.18.15-gke.1500)

I wrote a simple DAG to launch 50 pods at the same time, and let it run for a few hours:

from datetime import timedelta

from airflow import models
from airflow import utils
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

NUM_TASKS = 50

dag = models.DAG(
    'data-infrastructure-examples.launch-many-pods',
    start_date=utils.dates.days_ago(1),
    max_active_runs=1,
    dagrun_timeout=timedelta(minutes=10),
    schedule_interval='*/5 * * * *',
    concurrency=NUM_TASKS
)

for i in range(NUM_TASKS):
    task = KubernetesPodOperator(namespace='data-infrastructure-examples',
                                            image="busybox",
                                            name="hello-world-pod",
                                            in_cluster=True,
                                            task_id=f"TASK_{i}",
                                            get_logs=True,
                                            dag=dag,
                                            arguments=['sleep', '60'],
                                            is_delete_operator_pod=True
                                            )

I let this run for a few hours and was able to replicate the sporadic failures due to 409s observed in the original issue:

image

At the same time I ran a different version with the fixes from this PR applied to the pod_launcher. No failures were observed out of the 1000+ containers launched:

image

@boring-cyborg boring-cyborg bot added the provider:cncf-kubernetes Kubernetes provider related issues label Apr 1, 2021
@SamWheating
Copy link
Contributor Author

Since this requires changes to the pod_launcher and not the CNCF provider package, would it be possible to get this into the 2.0.2 release @ashb ?

stop=tenacity.stop_after_attempt(3),
wait=tenacity.wait_random_exponential(),
reraise=True,
retry=tenacity.retry_if_exception_type(ApiException),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ApiException cover? Would this also be the same error if you send invalid payload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, I think it would also cover 5xx errors due to invalid credentials in which case there definitely isn't any point to retrying.

With this in mind, should we scope this retry down to only cover 409 errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yeah probably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb @SamWheating this might be a good reason to move the pod_launching code into the cncf.kubernetes package. I don't think K8sPodOperators should require dependencies on Airflow for these kinds of fixes.

Copy link
Contributor Author

@SamWheating SamWheating Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just pushed a commit to only retry on 409 ApiExceptions, let me know what you think.

Re: the location of this code - it looks like the pod launching code is also used by the KubernetesExecutor, so if you wanted to move the pod_launcher to the CNCF provider package you would likely need a copy within the Airflow package as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are functions that are only used by the k8spodoperator/only used by the k8sexecutor, so wouldn't be TOO bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just looked a little more and it appears that you're right. Its probably not too big of a change then to move a subset of the pod_launcher file into the cncf provider package and update some imports and such, and I'd definitely be interested in helping with that / taking that on. Would you like to write up an issue for that, or shall I?

Regarding the issue in question, would you be OK with reviewing/merging this PR in the meantime? A larger refactor might take a while to get properly reviewed and this issue is causing a lot of failures on our end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamWheating I've created an issue here #15164. Please comment on it and I will assign it to you :).

I'd say you should attack this quickly, as without this fix we won't be able to release this fix until the next Airflow release (and it will require an Airflow upgrade).

That said, I'm glad to make this PR a high priority, so once it's ready I can be fast with PR reviews to get it through sooner than later.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 3, 2021
@github-actions
Copy link

github-actions bot commented Apr 3, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

dimberman added a commit to astronomer/airflow that referenced this pull request Apr 5, 2021
Currently, the KubernetesPodOperator uses the pod_launcher class in airflow core. This means that if we need to fix a bug in the KubernetesPodOperator such as apache#15137 then the new cncf.kubernetes package will require an Airflow upgrade. Since we hope to release providers in a much faster cadence than Airflow core releases, we should separate this dependency.
dimberman added a commit that referenced this pull request Apr 5, 2021
* Separate pod_launcher from core airflow

Currently, the KubernetesPodOperator uses the pod_launcher class in airflow core. This means that if we need to fix a bug in the KubernetesPodOperator such as #15137 then the new cncf.kubernetes package will require an Airflow upgrade. Since we hope to release providers in a much faster cadence than Airflow core releases, we should separate this dependency.

* fix podlauncher

* remove warnings from pod_launcher

* fix tests

* add deprecated class

* fix

* fix import

* one more nit

* fix docs

* fix docs again
@SamWheating SamWheating force-pushed the sw-backoff-on-409-error-k8s-pod-operator branch from 9d27b0c to fb5f077 Compare April 6, 2021 20:24
@SamWheating
Copy link
Contributor Author

@dimberman - thanks for getting that refactor in so quickly 🎉

I've rebased on master and moved my fix into the CNCF provider so this should be ready to re-review.

@potiuk
Copy link
Member

potiuk commented Apr 6, 2021

should we merge that one and include it in the provider release :P ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors when launching many pods simultaneously on GKE
5 participants