Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor pods by labels instead of names #6377

Merged
merged 5 commits into from
May 16, 2020

Conversation

dimberman
Copy link
Contributor

@dimberman dimberman commented Oct 21, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-5589
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@dimberman dimberman requested a review from ashb October 21, 2019 12:01
@dimberman dimberman changed the title AIRFLOW-5589 monitor pods by labels instead of names [AIRFLOW-5589] monitor pods by labels instead of names Oct 21, 2019
airflow/contrib/operators/kubernetes_pod_operator.py Outdated Show resolved Hide resolved
labels = {
'dag_id': context['dag'].dag_id,
'task_id': context['task'].task_id,
'exec_date': context['ts']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the try_number in here too - (i.e if try 1 is still running somehow but we want to launch try 2 we don't want to monitor try 1 again.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that could happen? Would the execution date be the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the task fails and it retires :) (It's unlikely to still be running, but some error/odd behavoiur could make it happen, but the "key" that uniqueue identifies a Task Instance is (dag_id,task_id, execution_date,try_number)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi guys,

I think we should consider the desired behaviour with this. If we're including the try_number then it opens a situation where multiple tries could be running alongside each other. I don't think that is ever desired & would certainly cause problems with my jobs (back to the original cause of this one 😄).

If we want to avoid monitoring the previous try, then should we check for it & kill it if exists before starting the next?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danccooper @ashb Question about this, I think retry_number would actually be good here... Especially if you are debugging and [kubernetes][delete_worker_pods] is "False". Seems like it would delete the original pod under @danccooper's suggestion and you'd lose that for debugging purposes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On our k8s cluster, we keep pods in the namespace for some time no matter the state they finished in (success, OOMKilled, evicted,....) -- I do not maintain the cluster and the decision is not up to me. Not having try_number as an identifying label of a pod makes airflow fail to launch another attempt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dakov I was going to suggest you just set the reattach_on_restart flag to False, however from checking the code I'm not sure this will work as expected which is probably the behaviour you're reporting? After discussing wth @kaxil he has rasied #10021 to cover this, suggest we continue discussion on there.

@ashb
Copy link
Member

ashb commented Oct 21, 2019

/cc @danccooper. We'll give you Co-Authored-By on this PR when we merge it.

@danccooper
Copy link

Thanks for moving this along @ashb @dimberman 👍

@dimberman dimberman force-pushed the duplicate-pods-pod-operator branch 2 times, most recently from 5299ab2 to 53c8469 Compare October 23, 2019 10:35
@codecov-io
Copy link

codecov-io commented Oct 23, 2019

Codecov Report

Merging #6377 into master will decrease coverage by 0.13%.
The diff coverage is 78.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6377      +/-   ##
==========================================
- Coverage   80.59%   80.45%   -0.14%     
==========================================
  Files         626      626              
  Lines       36243    36271      +28     
==========================================
- Hits        29211    29183      -28     
- Misses       7032     7088      +56
Impacted Files Coverage Δ
airflow/kubernetes/pod_launcher.py 91.97% <100%> (ø) ⬆️
airflow/executors/kubernetes_executor.py 58.19% <100%> (-0.81%) ⬇️
airflow/kubernetes/pod_generator.py 95.03% <100%> (+0.33%) ⬆️
...rflow/contrib/operators/kubernetes_pod_operator.py 88.29% <72.5%> (-10.21%) ⬇️
airflow/executors/sequential_executor.py 47.61% <0%> (-52.39%) ⬇️
airflow/utils/log/colored_log.py 81.81% <0%> (-11.37%) ⬇️
airflow/utils/sqlalchemy.py 86.44% <0%> (-6.78%) ⬇️
airflow/executors/__init__.py 63.26% <0%> (-4.09%) ⬇️
airflow/utils/dag_processing.py 56.23% <0%> (-2.67%) ⬇️
airflow/jobs/scheduler_job.py 73.72% <0%> (-1.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74d2a0d...53c8469. Read the comment docs.

@kaxil
Copy link
Member

kaxil commented Nov 13, 2019

@danccooper - FYI @dimberman is on holiday for a week, so I guess this will have to wait till he is back and has time to look at it.

@stale
Copy link

stale bot commented Dec 28, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 28, 2019
@stale stale bot closed this Jan 4, 2020
@kaxil kaxil added pinned Protect from Stalebot auto closing and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Jan 4, 2020
@boring-cyborg boring-cyborg bot added area:dev-tools area:Scheduler including HA (high availability) scheduler labels Jan 13, 2020
@dimberman dimberman dismissed ashb’s stale review May 16, 2020 21:13

issues addressed

@dimberman dimberman merged commit 8985df0 into apache:master May 16, 2020
@dimberman dimberman deleted the duplicate-pods-pod-operator branch May 16, 2020 21:14
@dimberman dimberman added this to the Airflow 1.10.11 milestone May 16, 2020
dimberman added a commit that referenced this pull request Jun 24, 2020
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
dimberman pushed a commit that referenced this pull request Jun 24, 2020
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
dimberman pushed a commit that referenced this pull request Jun 24, 2020
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
dimberman pushed a commit that referenced this pull request Jun 24, 2020
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
potiuk pushed a commit that referenced this pull request Jun 29, 2020
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
@kaxil kaxil added the type:improvement Changelog: Improvements label Jul 1, 2020
kaxil pushed a commit that referenced this pull request Jul 1, 2020
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
* Monitor k8sPodOperator pods by labels

To prevent situations where the scheduler starts a
second k8sPodOperator pod after a restart, we now check
for existing pods using kubernetes labels

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

Co-authored-by: Kaxil Naik <[email protected]>

* add docs

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <[email protected]>

Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Kaxil Naik <[email protected]>
(cherry picked from commit 8985df0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:Scheduler including HA (high availability) scheduler pinned Protect from Stalebot auto closing type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.