Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5660] Attempt to find the task in DB from Kubernetes pod labels #6340

Merged
merged 4 commits into from
Dec 12, 2019

Conversation

adityav
Copy link
Contributor

@adityav adityav commented Oct 15, 2019

…ing every task

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:
  1. Updates function to first do a task_id/dag_id search of the DB to significantly reduce search time
  2. In the _make_safe_label_value function we can add a warning if a task_id or dag_id will require hashing (which will slow down processing
  3. IF there is no database match we do a full scan.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@codecov-io
Copy link

Codecov Report

Merging #6340 into master will increase coverage by 0.37%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6340      +/-   ##
==========================================
+ Coverage   79.68%   80.06%   +0.37%     
==========================================
  Files         616      616              
  Lines       35798    35803       +5     
==========================================
+ Hits        28527    28666     +139     
+ Misses       7271     7137     -134
Impacted Files Coverage Δ
airflow/executors/kubernetes_executor.py 58.29% <0%> (-0.7%) ⬇️
airflow/models/taskinstance.py 93.79% <0%> (+0.5%) ⬆️
airflow/hooks/dbapi_hook.py 88.13% <0%> (+0.84%) ⬆️
airflow/models/connection.py 65% <0%> (+1.11%) ⬆️
airflow/jobs/scheduler_job.py 74.55% <0%> (+1.19%) ⬆️
airflow/hooks/hive_hooks.py 77.6% <0%> (+1.78%) ⬆️
airflow/utils/dag_processing.py 58.9% <0%> (+2.66%) ⬆️
airflow/executors/__init__.py 67.34% <0%> (+4.08%) ⬆️
airflow/utils/sqlalchemy.py 93.22% <0%> (+15.25%) ⬆️
airflow/utils/log/colored_log.py 93.18% <0%> (+20.45%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1de210b...90d2553. Read the comment docs.

@dimberman
Copy link
Contributor

LGTM. Rerunning tests.

@adityav
Copy link
Contributor Author

adityav commented Oct 15, 2019

Looks like there is a problem with CI? I have no idea why the build failed.

@ashb
Copy link
Member

ashb commented Oct 29, 2019

Have you run this in a Kube cluster? I have a feeling that every task will hit the bad path because of the characters in the execution date

@adityav
Copy link
Contributor Author

adityav commented Oct 29, 2019

Have you run this in a Kube cluster? I have a feeling that every task will hit the bad path because of the characters in the execution date

We are running airflow in EKS with this patch applied and it works. We can finally scale to 100,000+ tasks with this. Previously it would choke with 5k-10k tasks.
Only dag_id / task_id are being stored in the labels. The execution date isn't being stored there so it shouldn't be a problem.

Ideally I would prefer to eliminate the bad path altogether. Currently, it requires the dag writer to write good dag id / task id which isn't a good design. I can only think of 2 solns:

  1. Use task_id / dag_id stored in env variables. Values stored in env variables don't have any label specific restrictions. However, I am not familiar with kube api to know if this is a possible soln or not.
  2. have a mapping table of (exec_date, dag_id, task_id, safe_dag_id, safe_task_id) in airflow metadb.

@ashb
Copy link
Member

ashb commented Oct 30, 2019

Only dag_id / task_id are being stored in the labels. The execution date isn't being stored there so it shouldn't be a problem.

ex_time = self._label_safe_datestring_to_datetime(labels['execution_date'])
begs to differ. But _label_safe_datestring_to_datetime and it's reverse means it's probably okay:)

Final question: which log file does this warning show up in?

@ashb ashb changed the title [Airflow-5660] Try to find the task in DB before regressing to search… [AIRFLOW-5660] Attempt to find the task in DB from Kubernetes pod labels Oct 30, 2019
@adityav
Copy link
Contributor Author

adityav commented Oct 31, 2019

Only dag_id / task_id are being stored in the labels. The execution date isn't being stored there so it shouldn't be a problem.

ex_time = self._label_safe_datestring_to_datetime(labels['execution_date'])

begs to differ. But _label_safe_datestring_to_datetime and it's reverse means it's probably okay:)

My bad, didn't see that.

Final question: which log file does this warning show up in?

scheduler logs. They are in the console output of the scheduler pod.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry for leaving this languishing for so long)

airflow/executors/kubernetes_executor.py Show resolved Hide resolved
@ashb ashb merged commit 0f9983f into apache:master Dec 12, 2019
kaxil pushed a commit that referenced this pull request Dec 13, 2019
…els (#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 0f9983f)
kaxil pushed a commit that referenced this pull request Dec 13, 2019
…els (#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 0f9983f)
kaxil pushed a commit that referenced this pull request Dec 17, 2019
…els (#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 0f9983f)
dimberman pushed a commit to astronomer/airflow that referenced this pull request Dec 17, 2019
…els (apache#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
dimberman pushed a commit to astronomer/airflow that referenced this pull request Dec 17, 2019
…els (apache#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit d8f7d25)
kaxil pushed a commit to astronomer/airflow that referenced this pull request Dec 17, 2019
…els (apache#6340)

Try to find the task in DB before regressing to searching every task,
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 66e2c22e1615c0999747d0c38355163e877872e7)
dimberman pushed a commit to astronomer/airflow that referenced this pull request Dec 18, 2019
…els (apache#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit d8f7d25)
dimberman pushed a commit to astronomer/airflow that referenced this pull request Dec 18, 2019
…els (apache#6340)

Try to find the task in DB before regressing to searching every task,
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 66e2c22e1615c0999747d0c38355163e877872e7)
galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
…els (apache#6340)

Try to find the task in DB before regressing to searching every task, 
and explicitly warn about the performance regressions.

Co-Authored-By: Ash Berlin-Taylor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants