Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4074] Cannot put labels on Cloud Dataproc jobs #185

Closed
wants to merge 8 commits into from

Conversation

turbaszek
Copy link
Member

Add option to add labels to Dataproc jobs.

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-4074
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@6d78d7f). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #185   +/-   ##
=========================================
  Coverage          ?   79.11%           
=========================================
  Files             ?      489           
  Lines             ?    30691           
  Branches          ?        0           
=========================================
  Hits              ?    24280           
  Misses            ?     6411           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/contrib/hooks/gcp_dataproc_hook.py 61.84% <100%> (ø)
airflow/contrib/operators/dataproc_operator.py 86.18% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d78d7f...4229a62. Read the comment docs.

@codecov-io
Copy link

codecov-io commented Jul 17, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@6d78d7f). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #185   +/-   ##
=========================================
  Coverage          ?   79.11%           
=========================================
  Files             ?      489           
  Lines             ?    30692           
  Branches          ?        0           
=========================================
  Hits              ?    24281           
  Misses            ?     6411           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/contrib/hooks/gcp_dataproc_hook.py 62.06% <100%> (ø)
airflow/contrib/operators/dataproc_operator.py 86.18% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d78d7f...a1a1530. Read the comment docs.

@mik-laj
Copy link
Member

mik-laj commented Jul 17, 2019

To tag and track GCP resources spawned from Airflow, we have been adding airflow specific label(s) to GCP API service calls whenever possible and applicable. For example, we add “airflow-version” label to the dataproc cluster created. We expect new GCP service operators will also add Airflow top-level label “airflow-version” in addition to any service specific labels.

https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit?ts=5bb72dfd#

WDYT?

@turbaszek
Copy link
Member Author

To tag and track GCP resources spawned from Airflow, we have been adding airflow specific label(s) to GCP API service calls whenever possible and applicable. For example, we add “airflow-version” label to the dataproc cluster created. We expect new GCP service operators will also add Airflow top-level label “airflow-version” in addition to any service specific labels.

https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit?ts=5bb72dfd#

WDYT?

I can see no info about client_info in REST API (now we use discovery API in Dataproc). However, when when we will use Python API there should be such possibility but I think transition to native API should be in other PR.

@mik-laj
Copy link
Member

mik-laj commented Jul 18, 2019

In case of Discovery API, we could use a code similar to Cloud Composer.
https://github.com/mik-laj/cloud-composer-releases/blob/0a75d37def26fb42e6a8a450b600e91efff5c915/airflow/contrib/hooks/gcp_api_base_hook.py#L134-L157
but that should not be part of this PR.
I would just like to say that we should add additional labels with the Airflow version.
Example:
https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/gcp_container_hook.py#L218

@mik-laj
Copy link
Member

mik-laj commented Jul 18, 2019

Another example:
https://github.com/apache/airflow/blob/master/airflow/contrib/operators/gcs_operator.py#L116-L119

@turbaszek turbaszek force-pushed the dataproc-pyspark-labels branch 2 times, most recently from 0f419f0 to b79bca0 Compare July 18, 2019 08:25
jmcarp and others added 2 commits July 19, 2019 08:40
…ltiple jobs (apache#4633)

* AIRFLOW-3791: Dataflow
Support to check if job is already running before starting java job
In case dataflow creates more than one job, we need to track all jobs for status

* AIRFLOW-3791: Dataflow
Support to check if job is already running before starting java job
In case dataflow creates more than one job, we need to track all jobs for status

* Update airflow/contrib/hooks/gcp_dataflow_hook.py

Co-Authored-By: Fokko Driesprong <[email protected]>

* Update airflow/contrib/hooks/gcp_dataflow_hook.py

Co-Authored-By: Fokko Driesprong <[email protected]>

* Update gcp_dataflow_hook.py

* Update dataflow_operator.py

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
change default for check if running

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
merge redundant code of _get_job_id_from_name

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
merge redundant code of _get_job_id_from_name

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
merge redundant code of _get_job_id_from_name

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
merge redundant code of _get_job_id_from_name

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
merge redundant code of _get_job_id_from_name

* Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow
merge redundant code of _get_job_id_from_name
turbaszek and others added 6 commits July 19, 2019 16:23
Fixes `a bytes-like object is required, not 'str'` error in GKEPodOperator.
Fixes incorrect parameter order in GceHook._wait_for_operation_to_complete
method and adds additional tests.
Add option to add labels to Dataproc jobs.

fixup! [AIRFLOW-4074] Cannot put labels on Cloud Dataproc jobs
@turbaszek
Copy link
Member Author

Merged in apache#5606

@turbaszek turbaszek closed this Jul 22, 2019
@turbaszek turbaszek deleted the dataproc-pyspark-labels branch September 19, 2019 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants