-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor pods by labels instead of names #6377
Monitor pods by labels instead of names #6377
Conversation
labels = { | ||
'dag_id': context['dag'].dag_id, | ||
'task_id': context['task'].task_id, | ||
'exec_date': context['ts'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should include the try_number in here too - (i.e if try 1 is still running somehow but we want to launch try 2 we don't want to monitor try 1 again.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that something that could happen? Would the execution date be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if the task fails and it retires :) (It's unlikely to still be running, but some error/odd behavoiur could make it happen, but the "key" that uniqueue identifies a Task Instance is (dag_id,task_id, execution_date,try_number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi guys,
I think we should consider the desired behaviour with this. If we're including the try_number then it opens a situation where multiple tries could be running alongside each other. I don't think that is ever desired & would certainly cause problems with my jobs (back to the original cause of this one 😄).
If we want to avoid monitoring the previous try, then should we check for it & kill it if exists before starting the next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danccooper @ashb Question about this, I think retry_number would actually be good here... Especially if you are debugging and [kubernetes][delete_worker_pods] is "False". Seems like it would delete the original pod under @danccooper's suggestion and you'd lose that for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On our k8s cluster, we keep pods in the namespace for some time no matter the state they finished in (success, OOMKilled, evicted,....) -- I do not maintain the cluster and the decision is not up to me. Not having try_number as an identifying label of a pod makes airflow fail to launch another attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dakov I was going to suggest you just set the reattach_on_restart flag to False, however from checking the code I'm not sure this will work as expected which is probably the behaviour you're reporting? After discussing wth @kaxil he has rasied #10021 to cover this, suggest we continue discussion on there.
/cc @danccooper. We'll give you Co-Authored-By on this PR when we merge it. |
Thanks for moving this along @ashb @dimberman 👍 |
5299ab2
to
53c8469
Compare
Codecov Report
@@ Coverage Diff @@
## master #6377 +/- ##
==========================================
- Coverage 80.59% 80.45% -0.14%
==========================================
Files 626 626
Lines 36243 36271 +28
==========================================
- Hits 29211 29183 -28
- Misses 7032 7088 +56
Continue to review full report at Codecov.
|
43d7a3c
to
58a9f5a
Compare
@danccooper - FYI @dimberman is on holiday for a week, so I guess this will have to wait till he is back and has time to look at it. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)
Make sure you have checked all steps below.
Jira
Description
To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels
Tests
Commits
Documentation