Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4811] Implement GCP DLP' Hook and Operators #5539

Merged
merged 1 commit into from
Jul 30, 2019

Conversation

ryanyuan
Copy link
Contributor

@ryanyuan ryanyuan commented Jul 5, 2019

Implement GCP DLP' Hook and Operators

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow-4811 issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Implement GCP DLP' Hook and Operators using google.cloud.dlp_v2 from Google Cloud Client Libraries for Python.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    tests.contrib.operators.test_gcp_dlp_operator.py
    tests.contrib.operators.test_gcp_dlp_operator_system.py
    tests.contrib.hooks.test_gcp_dlp_hook.py

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@mik-laj
Copy link
Member

mik-laj commented Jul 6, 2019

Alternative implementation is available:
#5531

@ryanyuan ryanyuan force-pushed the gcp-dlp branch 3 times, most recently from 1c99ca2 to 24a2c4a Compare July 7, 2019 02:23
@ryanyuan
Copy link
Contributor Author

ryanyuan commented Jul 7, 2019

@mik-laj Any ideas on the pylint's failure from dataproc_operator? Can we just disable C0302 for that file?

@mik-laj
Copy link
Member

mik-laj commented Jul 8, 2019

@ryanyuan Yes. You can disable this warning for this file. We do not have a good way to deal with this problem.

@zzlbuaa
Copy link

zzlbuaa commented Jul 9, 2019

@ryanyuan It seems that there are syntax errors with ecb67a2#diff-a67b10df84accccf123dba4c948582fdR37 and ecb67a2#diff-3e59770742eafcf9f0447065d33017fcR64.
The f-string format is not valid syntax in Python3.5. Change it to str.format() might solve this failure.

@ryanyuan
Copy link
Contributor Author

ryanyuan commented Jul 9, 2019

@zzlbuaa Good catch. Cheers!

@ryanyuan
Copy link
Contributor Author

@ashb @mik-laj It seems Travis failed on tests.www.test_views.TestAirflowBaseViews.test_clear. Any ideas?

@zzlbuaa
Copy link

zzlbuaa commented Jul 10, 2019

@ryanyuan @mik-laj Since create_dlp_job is an asynchronous call(completes while job is still pending or running), and the job can take seconds, minutes, or hours to run depending on inspected data size, one feature we want to have is a RunDlpJobOperator that could create a dlpJob and keep polling its status via get_dlp_job until the job is done or canceled/failed. Do you have any suggestions on where the actual looping and polling status thing should be, in a hook function or in the operator?

I have that operator in my implementation, and it does the actual looping in a hook function:
The hook function: e0a4328#diff-86a3d285c23e0963485636db3bc28fe1R298
The RunDlpJobOperator: e0a4328#diff-bedc38985a98c3b915ead387ecc351e2
Does it look good to you?

For more info about dlpJob: https://cloud.google.com/dlp/docs/inspecting-storage

cc @criccomini @whynick1

@ryanyuan
Copy link
Contributor Author

@zzlbuaa Nice suggestion! I would put that logic in along with an optional parameter, which will be true by default for waiting, in the hook and the operator to let the user decide whether they want to wait or not.

@ryanyuan ryanyuan force-pushed the gcp-dlp branch 2 times, most recently from c51f875 to 369c295 Compare July 12, 2019 07:24
@ryanyuan
Copy link
Contributor Author

@mik-laj PTAL

@mik-laj
Copy link
Member

mik-laj commented Jul 12, 2019

Today i was escalating the question about library for this service to Google employee. If i do not get response in 3 working days, I will do a review.

@mik-laj
Copy link
Member

mik-laj commented Jul 12, 2019

I got a message. New operators should use google-cloud-python, if possible, so i do a review this PR.

airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
airflow/contrib/hooks/gcp_dlp_hook.py Outdated Show resolved Hide resolved
@mik-laj
Copy link
Member

mik-laj commented Jul 12, 2019

Hello. I see you add a lot of operators and GCP integration.
I have two question:

  1. Do you have other GCP operators in plans? My team has plans for operators for this quarter and we can talk about my and your future.
  2. Do not you want to outsource this work?

I also available at: [email protected]

@ryanyuan
Copy link
Contributor Author

Hello. I see you add a lot of operators and GCP integration.
I have two question:

  1. Do you have other GCP operators in plans? My team has plans for operators for this quarter and we can talk about my and your future.
  2. Do not you want to outsource this work?

I also available at: [email protected]

@mik-laj I just sent you an email.

Creates a job trigger to run DLP actions such as scanning storage for sensitive
information on a set schedule.

:param project_id: (Optional) Google Cloud Platform project ID where the
Copy link
Member

@mik-laj mik-laj Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code reads the project only from the parameters. Am I wrong? Is it worth to add support for reading the project ID from the configuration?


client = self.get_conn()

if not project_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like dead code, because decorator fallback_to_default_project_id prevent to execute this part of code

:param organization_id: (Optional) The organization ID. Required to set this
field if parent resource is an organzation.
:type organization_id: str
:param project_id: (Optional) Google Cloud Platform project ID where the
Copy link
Member

@mik-laj mik-laj Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code reads the project only from the parameters. Am I wrong? Is it worth to add support for reading the project ID from the configuration?


client = self.get_conn()

if not project_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like dead code, because decorator fallback_to_default_project_id prevent to execute this part of code

)
def test_get_dlp_job_without_dlp_job_id(self, _):
with self.assertRaises(AirflowException):
self.hook.get_dlp_job(dlp_job_id=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks absence of project_id

Copy link
Member

@mik-laj mik-laj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few very minor problems. I prepared the commit. Can you check if the fixes suit you?

curl https://termbin.com/rmts | git am

I apologize for the long review time, but recently I have a lot of work on Airflow. I introduced changes in the organization of work so that I would have more time to review in the community.

@ryanyuan
Copy link
Contributor Author

I found a few very minor problems. I prepared the commit. Can you check if the fixes suit you?

curl https://termbin.com/rmts | git am

I apologize for the long review time, but recently I have a lot of work on Airflow. I introduced changes in the organization of work so that I would have more time to review in the community.

@mik-laj It looks like your patch is the same as my commit. Did you export the wrong commit?

@mik-laj
Copy link
Member

mik-laj commented Jul 30, 2019

Yes. I exported wrong commit

curl https://termbin.com/f8ua | git am

@mik-laj
Copy link
Member

mik-laj commented Jul 30, 2019

If we get a quick consensus, I will try to add these changes to the 1.10.4 release

@ryanyuan
Copy link
Contributor Author

@mik-laj cheers, I will have a look now.

@ryanyuan
Copy link
Contributor Author

@mik-laj
If we have @GoogleCloudBaseHook.fallback_to_default_project_id for a function, we don't need project_id = project_id or self.project_id, do we?

@mik-laj
Copy link
Member

mik-laj commented Jul 30, 2019

@ryanyuan Yes. We only need it when there is no decorator.

Implement GCP DLP' Hook and Operators
@mik-laj
Copy link
Member

mik-laj commented Jul 30, 2019

we are waiting for Travis. :-D

@mik-laj mik-laj merged commit 6ef0e37 into apache:master Jul 30, 2019
@mik-laj
Copy link
Member

mik-laj commented Jul 30, 2019

@ryanyuan Can you add typehints in a separate PR? I have accepted the current version, because in 1.10.4 typehint is not supprted fully. In version 2.0 I would like all hooks and operators to have typehints.
I have already created a PR that changes all gcp_*_hook.py files. This PR introduces one hook that does not fit the rest.
1805436

@ryanyuan
Copy link
Contributor Author

ryanyuan commented Jul 30, 2019

@ryanyuan Can you add typehints in a separate PR? I have accepted the current version, because in 1.10.4 typehint is not supprted fully. In version 2.0 I would like all hooks and operators to have typehints.
I have already created a PR that changes all gcp_*_hook.py files. This PR introduces one hook that does not fit the rest.
1805436

@mik-laj No problem. I will work on that.
GCP DLP hook typehint PR: #5980

potiuk pushed a commit that referenced this pull request Jul 30, 2019
Implement GCP DLP' Hook and Operators

(cherry picked from commit 6ef0e37)
potiuk pushed a commit that referenced this pull request Jul 31, 2019
Implement GCP DLP' Hook and Operators

(cherry picked from commit 6ef0e37)
ashb pushed a commit to ashb/airflow that referenced this pull request Jul 31, 2019
Implement GCP DLP' Hook and Operators

(cherry picked from commit 6ef0e37)
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
Implement GCP DLP' Hook and Operators

(cherry picked from commit 6ef0e37)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants