-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix KubernetesPodOperator reattachment #10230
Conversation
in 1.10.11 we introduced a bug where the KubernetesPodOperator was not properly reattaching due to implementation errors. This fix will allow users to determine reattachment based on the `reattach_on_restart` config
cc: @danccooper |
LGTM, logic is much clearer, thank you. One thing to consider is the comment by @dakov here: #6377 (comment) Perhaps on line 280 where we check for 0 or 1 existing pod & raise otherwise, we should only raise if |
Co-authored-by: Kaxil Naik <[email protected]>
Yeah that makes sense. Tbh I'll be surprised if many people turn off |
Thanks @dimberman LGTM 👍 |
(cherry picked from commit 8cd2be9)
(cherry picked from commit 8cd2be9)
(cherry picked from commit 8cd2be9)
Hi @dimberman, I was doing more airflow testing and I think this PR also addresses this issue #10325 (I was having on older Airflow version). Which is pretty great (we had issues in production with this the other day) ! Unfortunately, I still can experience issues with the KubernetesPodOperator (with the latest 1.10.12rc):
|
Hi @FloChehab, Can you please post the scheduler logs for the scheduler where it is up for retry + the DAG code? Seems odd that on second restart it would come out as a success and just want to make sure. |
Just to be clear @FloChehab , were these issues introduced from 1.10.11 or 1.10.12rcs ? If not we will still definitely fix it, but will continue releasing 1.10.12 |
Here you go: Dag: from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.kubernetes.secret import Secret
from airflow.models import DAG
from airflow.utils.dates import days_ago
default_args = {
'owner': 'Airflow',
'start_date': days_ago(2),
'retries': 3
}
with DAG(
dag_id='bug_kuberntes_pod_operator',
default_args=default_args,
schedule_interval=None
) as dag:
k = KubernetesPodOperator(
namespace='dev-airflow-helm',
image="ubuntu:16.04",
cmds=["bash", "-cx"],
arguments=["sleep 100"],
name="airflow-test-pod",
task_id="task",
get_logs=True,
is_delete_operator_pod=True,
) Logs during first restart (stuck in up_for_retry in DB and UI) For some reason I didn't get the |
Sure, that what I was thinking too. Regarding the version it was introduced, I'd say that before 1.10.12 we had a way bigger problem so I definitely don't see a blocker for releasing 1.10.12. |
(But I can't say if the issue was present or not in airflow >1.10.2&<1.10.12 as I haven't tested those versions) |
Same phenomenon with LocalExecutor (I've cleaned all the Persistent Volume Claim before testing with the LocalExecutor): State in db just before the scheduler (after the second restarts), picks the task: postgres=# select * from task_instance where state = 'up_for_retry';
-[ RECORD 1 ]---+------------------------------
task_id | task
dag_id | bug_kuberntes_pod_operator
execution_date | 2020-08-24 19:02:51.616716+00
start_date | 2020-08-24 19:02:57.048154+00
end_date | 2020-08-24 19:05:56.199493+00
duration | 179.151339
state | up_for_retry
try_number | 1
hostname |
unixname | airflow
job_id | 2
pool | default_pool
queue | celery
priority_weight | 1
operator | KubernetesPodOperator
queued_dttm | 2020-08-24 19:02:53.814477+00
pid | 628
max_tries | 3
executor_config | \x80057d942e
pool_slots | 1 EDIT: in the case of the localExecutor I am starting the webserver with the scheduler, so don't get confused by what the logs say sometimes. |
Thank you @FloChehab. I think since this feature was already broken in 1.10.11 we're not going to block the 1.10.12 release for this, though this should be a necessary fix for 1.10.13 |
👍 I have not encountered the case where 2 pods end up running the same task simultaneously while testing the latest 1.10.12rc (which can cause some real inconsistencies in our case -- we had this issue on Friday on an old 1.10.2 airflow). So no blocker for me here. I will add some comments in the issue tomorrow. |
@dimberman Do you have instructions on how to install airflow reproducibility ? (so that we compare the same thing -- I am not familiar with installing airflow manually) I am using the official helm chart + LocalExecutor + latest |
@FloChehab what I did was the following:
|
So I have something a bit magical going on:
However I don't even need to restart the webserver or the scheduler:
I don't really get what is going on ; nor how a pod in a remote cluster can talk to my local airflow db (which shouldn't be what is going on). I must have some airflow process in the background monitoring the pod, but I can't seem to find it... Too many weird stuff going on. |
Hmm... this might have to do with airflow leaving behind a zombie process, so it's harder to get a real interruption when running locally. Will test that now. |
So ok, funny enough, I think because we added an on_kill to the KubernetesPodOperator, it now kills the pod if the process dies. Not sure if that counts as a solution or not, gonna need to think about this. |
Oh wait it's not the on_kill. It's these lines
|
so if it recieves an error from a SIGTERM it deletes the pod because of |
Hum, I am not sure I would do that. I think that the life of the worker / "object" that is starting / monitoring / etc. the pod shouldn't impact the pod itself (we have usecases with very long jobs started from airflow on kubernetes and I don't think it would play nicely with this) |
Yeah agreed. For now if you set is_delete_operator_pod to false it fixes it. |
So, with |
@FloChehab what happens if you are running this with the helm chart, you get to the "up_for_retry" state, and then you manually rerun the task with "ignore all deps" |
Ok let's see :) |
(just need a bit more time to build the production image for 1.10-test) |
Just tested with 1.10.12 (while the image is building) and is_delete_operator_pod=false. This time the task seemed stucked in running on first scheduler restart. And I got this if I tried the suggested action. I guess I am going to test with a Celery setup:
|
So this time with image from v1-10-test + helm + KEDA:
|
And the scheduler logs on restart: @dimberman I have to stop my investigations for today, but I'll be more than happy to help tomorrow. |
@FloChehab Ok that's a good sign (thank you btw). One more question, have you tried leaving the task in @ashb @kaxil this seems like it might just be the scheduler retry_timeout yeah? Like the clock to retry a failed task starts when the scheduler restarts and just takes a few minutes? |
Ok, i'll try that last one (in 1.10.12 + LocalExecutor + is_delete_operator_pod=True -- otherwise I won't get the stuck in up for retry), take a swim and come back. |
@dimberman You were right ! After ~10 minutes it got picked out of "up_for_retry" state. I guess I was a bit confused by the logs showing that the scheduler is running and not taking up "up_for_retry" tasks. EDIT: must have been 5 minutes actually. |
And the default |
So, I've set retry_delay to 10s. On scheduler restart the task is stucked in "running" state for ~4 minutes (while being "completed" on kubernetes side before scheduler restart) then it switches to up_for_retry and finally 10s later, everything is fine. |
Hi, So to sum up:
|
I am facing the same issue with KubernetesExecutor + KubernetesPodOperator. The only error I can see is in scheduler log where it says: [2020-09-11 05:36:13,724] {scheduler_job.py:1351} ERROR - Executor reports task instance <TaskInstance: kubernetes_sample.passing-task 2020-09-11 05:30:59.027877+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? Did anyone face this issue and have any solution? |
(cherry picked from commit 8cd2be9)
in 1.10.11 we introduced a bug where the KubernetesPodOperator
was not properly reattaching due to implementation errors.
This fix will allow users to determine reattachment based on the
reattach_on_restart
config^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.