Monitor pods by labels instead of names #6377

dimberman · 2019-10-21T11:54:12Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-5589
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
- In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

Here are some details about my PR, including screenshots of any UI changes:

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

airflow/contrib/operators/kubernetes_pod_operator.py

ashb · 2019-10-21T15:13:28Z

airflow/contrib/operators/kubernetes_pod_operator.py

+        labels = {
+            'dag_id': context['dag'].dag_id,
+            'task_id': context['task'].task_id,
+            'exec_date': context['ts']


We should include the try_number in here too - (i.e if try 1 is still running somehow but we want to launch try 2 we don't want to monitor try 1 again.)

Is that something that could happen? Would the execution date be the same?

Yeah, if the task fails and it retires :) (It's unlikely to still be running, but some error/odd behavoiur could make it happen, but the "key" that uniqueue identifies a Task Instance is (dag_id,task_id, execution_date,try_number)

Hi guys,

I think we should consider the desired behaviour with this. If we're including the try_number then it opens a situation where multiple tries could be running alongside each other. I don't think that is ever desired & would certainly cause problems with my jobs (back to the original cause of this one 😄).

If we want to avoid monitoring the previous try, then should we check for it & kill it if exists before starting the next?

@danccooper @ashb Question about this, I think retry_number would actually be good here... Especially if you are debugging and [kubernetes][delete_worker_pods] is "False". Seems like it would delete the original pod under @danccooper's suggestion and you'd lose that for debugging purposes.

On our k8s cluster, we keep pods in the namespace for some time no matter the state they finished in (success, OOMKilled, evicted,....) -- I do not maintain the cluster and the decision is not up to me. Not having try_number as an identifying label of a pod makes airflow fail to launch another attempt.

Hey @dakov I was going to suggest you just set the reattach_on_restart flag to False, however from checking the code I'm not sure this will work as expected which is probably the behaviour you're reporting? After discussing wth @kaxil he has rasied #10021 to cover this, suggest we continue discussion on there.

ashb · 2019-10-21T15:14:12Z

/cc @danccooper. We'll give you Co-Authored-By on this PR when we merge it.

danccooper · 2019-10-21T15:17:59Z

Thanks for moving this along @ashb @dimberman 👍

airflow/contrib/operators/kubernetes_pod_operator.py

tests/integration/kubernetes/test_kubernetes_pod_operator.py

codecov-io · 2019-10-23T11:19:12Z

Codecov Report

Merging #6377 into master will decrease coverage by 0.13%.
The diff coverage is 78.84%.

@@            Coverage Diff             @@
##           master    #6377      +/-   ##
==========================================
- Coverage   80.59%   80.45%   -0.14%     
==========================================
  Files         626      626              
  Lines       36243    36271      +28     
==========================================
- Hits        29211    29183      -28     
- Misses       7032     7088      +56

Impacted Files	Coverage Δ
airflow/kubernetes/pod_launcher.py	`91.97% <100%> (ø)`	⬆️
airflow/executors/kubernetes_executor.py	`58.19% <100%> (-0.81%)`	⬇️
airflow/kubernetes/pod_generator.py	`95.03% <100%> (+0.33%)`	⬆️
...rflow/contrib/operators/kubernetes_pod_operator.py	`88.29% <72.5%> (-10.21%)`	⬇️
airflow/executors/sequential_executor.py	`47.61% <0%> (-52.39%)`	⬇️
airflow/utils/log/colored_log.py	`81.81% <0%> (-11.37%)`	⬇️
airflow/utils/sqlalchemy.py	`86.44% <0%> (-6.78%)`	⬇️
airflow/executors/__init__.py	`63.26% <0%> (-4.09%)`	⬇️
airflow/utils/dag_processing.py	`56.23% <0%> (-2.67%)`	⬇️
airflow/jobs/scheduler_job.py	`73.72% <0%> (-1.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74d2a0d...53c8469. Read the comment docs.

tests/integration/kubernetes/test_kubernetes_pod_operator.py

airflow/contrib/operators/kubernetes_pod_operator.py

kaxil · 2019-11-13T14:21:31Z

@danccooper - FYI @dimberman is on holiday for a week, so I guess this will have to wait till he is back and has time to look at it.

stale · 2019-12-28T14:45:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

issues addressed

* Monitor k8sPodOperator pods by labels To prevent situations where the scheduler starts a second k8sPodOperator pod after a restart, we now check for existing pods using kubernetes labels * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Co-authored-by: Kaxil Naik <[email protected]> * add docs * Update airflow/kubernetes/pod_launcher.py Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Kaxil Naik <[email protected]> (cherry picked from commit 8985df0)

dimberman requested a review from ashb October 21, 2019 12:01

ashb reviewed Oct 21, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

ashb reviewed Oct 21, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

ashb reviewed Oct 21, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

dimberman changed the title ~~AIRFLOW-5589 monitor pods by labels instead of names~~ [AIRFLOW-5589] monitor pods by labels instead of names Oct 21, 2019

ashb reviewed Oct 21, 2019

View reviewed changes

mik-laj reviewed Oct 21, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

mik-laj reviewed Oct 21, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

ashb requested changes Oct 22, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

ashb reviewed Oct 22, 2019

View reviewed changes

tests/integration/kubernetes/test_kubernetes_pod_operator.py Outdated Show resolved Hide resolved

dimberman force-pushed the duplicate-pods-pod-operator branch 2 times, most recently from 5299ab2 to 53c8469 Compare October 23, 2019 10:35

ashb reviewed Oct 23, 2019

View reviewed changes

tests/integration/kubernetes/test_kubernetes_pod_operator.py Outdated Show resolved Hide resolved

tests/integration/kubernetes/test_kubernetes_pod_operator.py Outdated Show resolved Hide resolved

danccooper reviewed Oct 23, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

mik-laj added the k8s label Oct 23, 2019

dimberman force-pushed the duplicate-pods-pod-operator branch from 43d7a3c to 58a9f5a Compare October 23, 2019 15:17

danccooper reviewed Oct 23, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

ashb reviewed Oct 23, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

ashb reviewed Oct 24, 2019

View reviewed changes

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 28, 2019

stale bot closed this Jan 4, 2020

kaxil added pinned Protect from Stalebot auto closing and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Jan 4, 2020

boring-cyborg bot added area:dev-tools area:Scheduler including HA (high availability) scheduler labels Jan 13, 2020

kaxil approved these changes May 16, 2020

View reviewed changes

dimberman merged commit 8985df0 into apache:master May 16, 2020

dimberman deleted the duplicate-pods-pod-operator branch May 16, 2020 21:14

dimberman added this to the Airflow 1.10.11 milestone May 16, 2020

kaxil added the type:improvement Changelog: Improvements label Jul 1, 2020

This was referenced Jul 27, 2020

Re-Add try_number as identify label on a POD when using KubernetesPodOperator #10020

Closed

Remove try_number check to decide whether to monitor or create new POD for K8sPodOperator #10021

Closed

danccooper mentioned this pull request Aug 7, 2020

Fix KubernetesPodOperator reattachment #10230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor pods by labels instead of names #6377

Monitor pods by labels instead of names #6377

dimberman commented Oct 21, 2019 •

edited

Loading

ashb Oct 21, 2019

dimberman Oct 23, 2019

ashb Oct 23, 2019

danccooper Oct 23, 2019

AXington Apr 8, 2020

dakov Jul 23, 2020

danccooper Jul 27, 2020

ashb commented Oct 21, 2019

danccooper commented Oct 21, 2019

codecov-io commented Oct 23, 2019 •

edited

Loading

kaxil commented Nov 13, 2019 •

edited

Loading

stale bot commented Dec 28, 2019

Monitor pods by labels instead of names #6377

Monitor pods by labels instead of names #6377

Conversation

dimberman commented Oct 21, 2019 • edited Loading

Jira

Description

Tests

Commits

Documentation

ashb Oct 21, 2019

Choose a reason for hiding this comment

dimberman Oct 23, 2019

Choose a reason for hiding this comment

ashb Oct 23, 2019

Choose a reason for hiding this comment

danccooper Oct 23, 2019

Choose a reason for hiding this comment

AXington Apr 8, 2020

Choose a reason for hiding this comment

dakov Jul 23, 2020

Choose a reason for hiding this comment

danccooper Jul 27, 2020

Choose a reason for hiding this comment

ashb commented Oct 21, 2019

danccooper commented Oct 21, 2019

codecov-io commented Oct 23, 2019 • edited Loading

Codecov Report

kaxil commented Nov 13, 2019 • edited Loading

stale bot commented Dec 28, 2019

dimberman commented Oct 21, 2019 •

edited

Loading

codecov-io commented Oct 23, 2019 •

edited

Loading

kaxil commented Nov 13, 2019 •

edited

Loading