Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5168] Fix Dataproc Operators & add tests #5928

Merged
merged 4 commits into from
Aug 28, 2019

Conversation

kaxil
Copy link
Member

@kaxil kaxil commented Aug 27, 2019

PR targeted to 1.10.4. Creating new PR for master

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

2 issues in 1.10.4

(1) Invalid template field. It was resolved by #5751
(2) An issue with add_labels field expained below:

There is a commit to master which adds label support, this has a change in the hook & base operator: 4d58f36#diff-bc55ad8b749b7a136f51ffabfdbaf13dL684

Then in the 1.10.4 branch there is a massive commit for GCP DLP: 222c6ac

This seems to have leaked the operator part of the label change only: 222c6ac#diff-bc55ad8b749b7a136f51ffabfdbaf13dR693

This then manifests as this error when trying to execute a Dataproc job: [2019-08-09 14:32:02,971] {taskinstance.py:1047} ERROR - _DataProcJobBuilder instance has no attribute 'add_labels'

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Add tests to cover both issues

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@kaxil kaxil requested review from ashb and potiuk August 27, 2019 18:39
@potiuk potiuk force-pushed the v1-10-test branch 2 times, most recently from e3f39af to 78076bc Compare August 28, 2019 03:03
)

self.assertEqual(task.template_fields,
['cluster_name', 'project_id', 'region'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to test if elements of template_fields are attributes of instance. Then changing attribute name without changing template field would result in test failure. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ti.render_template() should take care of that case. It would fail if it doesn't fiend an attribute listed in template_fields

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Aug 28, 2019
@@ -660,15 +656,13 @@ def __init__(self,
dataproc_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
labels=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is here I don't think we can remove it without breaking anyone who got it working on 1.10.4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will cherry-pick the actual commit that added labels

Copy link
Member Author

@kaxil kaxil Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will cherry-pick 4d58f36 into v1-10-test after this merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, though labels never worked. right.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments, great job on the test coverage 👍

@@ -385,7 +393,7 @@ def test_create_cluster(self):
requestId=mock.ANY,
body={
'projectId': 'test-project-id',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace this one with GCP_PROJECT_ID

@@ -492,6 +522,27 @@ def test_update_cluster(self):
})
hook.wait.assert_called_once_with(self.operation)

def test_render_template(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a bit silly, maybe we can check if we can actually template a field.

Copy link
Member Author

@kaxil kaxil Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are actually checking to template :D

DataprocClusterScaleOperator is passed GCP_PROJECT_TEMPLATED, CLUSTER_NAME_TEMPLATED etc which are defined at the top of the file containing Jinja templated strings.

And while asserting we check GCP_PROJECT & CLUSTER_NAME containing rendered fields


class DataProcSparkOperatorTest(unittest.TestCase):
# Unit test for the DataProcSparkOperator
def setUp(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove these setUp's with a fixture. In the end they are not really setting up something, but just setting a global variable. Creating a fixture is nicer, which we can reuse in multiple tests:


@pytest.fixture
def dag():
    return DAG(
        DAG_ID,
            default_args={
            'owner': 'airflow',
            'start_date': DEFAULT_DATE,
        },
        schedule_interval='@daily'
    )

And then in the actual test:

def test_render_template(self, dag):
    task = DataProcSparkOperator(
        dag=dag
    )

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, didn't knew about @pytest.fixture. I will create a separate PR for that as we already have some tests in master and then cherry-pick to v1-10-test. This PR is for v1-10-test.

TASK_ID = 'test-dataproc-operator'
CLUSTER_NAME = 'test-cluster-name'
CLUSTER_NAME = 'airflow_{}_cluster'.format(DAG_ID)
CLUSTER_NAME_TEMPLATED = 'airflow_{{ dag.dag_id }}_cluster'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko here

@kaxil kaxil merged commit 73dda73 into apache:v1-10-test Aug 28, 2019
@kaxil kaxil deleted the fix-dataproc branch August 28, 2019 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants