-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding retries when starting kubernetes pods #15137
Adding retries when starting kubernetes pods #15137
Conversation
Since this requires changes to the pod_launcher and not the CNCF provider package, would it be possible to get this into the 2.0.2 release @ashb ? |
airflow/kubernetes/pod_launcher.py
Outdated
stop=tenacity.stop_after_attempt(3), | ||
wait=tenacity.wait_random_exponential(), | ||
reraise=True, | ||
retry=tenacity.retry_if_exception_type(ApiException), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does ApiException cover? Would this also be the same error if you send invalid payload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, I think it would also cover 5xx errors due to invalid credentials in which case there definitely isn't any point to retrying.
With this in mind, should we scope this retry down to only cover 409 errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yeah probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashb @SamWheating this might be a good reason to move the pod_launching code into the cncf.kubernetes package. I don't think K8sPodOperators should require dependencies on Airflow for these kinds of fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just pushed a commit to only retry on 409 ApiExceptions, let me know what you think.
Re: the location of this code - it looks like the pod launching code is also used by the KubernetesExecutor, so if you wanted to move the pod_launcher to the CNCF provider package you would likely need a copy within the Airflow package as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are functions that are only used by the k8spodoperator/only used by the k8sexecutor, so wouldn't be TOO bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just looked a little more and it appears that you're right. Its probably not too big of a change then to move a subset of the pod_launcher file into the cncf provider package and update some imports and such, and I'd definitely be interested in helping with that / taking that on. Would you like to write up an issue for that, or shall I?
Regarding the issue in question, would you be OK with reviewing/merging this PR in the meantime? A larger refactor might take a while to get properly reviewed and this issue is causing a lot of failures on our end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SamWheating I've created an issue here #15164. Please comment on it and I will assign it to you :).
I'd say you should attack this quickly, as without this fix we won't be able to release this fix until the next Airflow release (and it will require an Airflow upgrade).
That said, I'm glad to make this PR a high priority, so once it's ready I can be fast with PR reviews to get it through sooner than later.
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Currently, the KubernetesPodOperator uses the pod_launcher class in airflow core. This means that if we need to fix a bug in the KubernetesPodOperator such as apache#15137 then the new cncf.kubernetes package will require an Airflow upgrade. Since we hope to release providers in a much faster cadence than Airflow core releases, we should separate this dependency.
* Separate pod_launcher from core airflow Currently, the KubernetesPodOperator uses the pod_launcher class in airflow core. This means that if we need to fix a bug in the KubernetesPodOperator such as #15137 then the new cncf.kubernetes package will require an Airflow upgrade. Since we hope to release providers in a much faster cadence than Airflow core releases, we should separate this dependency. * fix podlauncher * remove warnings from pod_launcher * fix tests * add deprecated class * fix * fix import * one more nit * fix docs * fix docs again
9d27b0c
to
fb5f077
Compare
@dimberman - thanks for getting that refactor in so quickly 🎉 I've rebased on master and moved my fix into the CNCF provider so this should be ready to re-review. |
should we merge that one and include it in the provider release :P ? |
closes: #15097
I'll rebase + move this into the CNCF provider package once #15165 is merged.
The Kubernetes API server relies on optimistic concurrency for smieltaneous API requests, so 409 errors are to be expected and should be handled by the application (there's a good explanation here under "Optimistically concurrent updates"). This PR adds retries to handle Kubernetes API Exceptions while trying to start a pod.
I've opted for a random exponential backoff since the issues we've been seeing have been the result of too many simultaneous requests, so using a retry without jitter could lead to just repeating the same race conditions over and over.
This will only retry in the event of HTTP 409 reponses from the kubernetes API, but can be easily extended if there's more Exceptions which should be handled.
I'll try to run some experiments this morning to replicate the original issue and confirm that this fixes things.
Experiments / Validation:
(All run on Airflow 2.0.1, Kubernetes version 1.18.15-gke.1500)
I wrote a simple DAG to launch 50 pods at the same time, and let it run for a few hours:
I let this run for a few hours and was able to replicate the sporadic failures due to 409s observed in the original issue:
At the same time I ran a different version with the fixes from this PR applied to the pod_launcher. No failures were observed out of the 1000+ containers launched: