Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4161] BigQuery to Mysql Operator #5711

Merged
merged 3 commits into from Aug 7, 2019
Merged

[AIRFLOW-4161] BigQuery to Mysql Operator #5711

merged 3 commits into from Aug 7, 2019

Conversation

adussarps
Copy link
Contributor

@adussarps adussarps commented Aug 2, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    New Connector to copy a bigquery table into a mysql table.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Tested base behaviour in bq test suite.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Aug 3, 2019
# Max results is set to 1000 because bq job has an hardcoded limit to 1300.
response = cursor.get_tabledata(dataset_id=self.dataset_id,
table_id=self.table_id,
max_results=1000,
Copy link
Member

@potiuk potiuk Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It would be nice to have it extracted as constant or even better provide it as parameters with default. I imagine you want to run it in smaller batches - for various reasons.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice - but small change requested. Having magic constants like that in the code is not a good thing. Especially if in the future BQ will increase the limit, we should be able to change the batch size without changing the code of the operator - just providing the parameter.

Adding 'batch_size' as an entry parameter
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

table_id,
mysql_table,
selected_fields=None,
bigquery_conn_id='bigquery_default',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use 'google_cloud_default' as conn_id here because we've already unified all the GCP conn_id.
Reference: #4818

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use gcp_conn_id as parameter name here because it is written in integration recommendations. In last night, I started working on unification
PolideaInternal#201

mysql_table,
selected_fields=None,
bigquery_conn_id='bigquery_default',
mysql_conn_id='mysql_default',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use 'google_cloud_default' as conn_id here because we've already unified all the GCP conn_id?
Reference: #4818

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use gcp_conn_id as parameter name here because it is written in integration recommendations. In last night, I started working on unification
PolideaInternal#201

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input, this is fixed!

@eladkal
Copy link
Contributor

eladkal commented Aug 6, 2019

What is expected to happen if one of the selected columns is RECORD type?

Fixed pylint too many arguments error
Changed bq_conn_id to gcp_conn_id
@adussarps
Copy link
Contributor Author

What is expected to happen if one of the selected columns is RECORD type?

I think the script should be able to pass it as a json type to mysql.
The mysql column receiving should be json type for it to work.

I didn't try this scenario thought.

@OmerJog
Copy link
Contributor

OmerJog commented Aug 6, 2019

if you added new file:
airflow/contrib/operators/bigquery_to_mysql_operator.py
I think the tests needs to be in:
/tests/contrib/operators/test_bigquery_to_mysql_operator.py
Not in:
tests/contrib/operators/test_bigquery_operator.py

@codecov-io
Copy link

Codecov Report

Merging #5711 into master will increase coverage by 0.25%.
The diff coverage is 72.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5711      +/-   ##
==========================================
+ Coverage   79.58%   79.84%   +0.25%     
==========================================
  Files         494      495       +1     
  Lines       31725    31781      +56     
==========================================
+ Hits        25248    25375     +127     
+ Misses       6477     6406      -71
Impacted Files Coverage Δ
...ow/contrib/operators/bigquery_to_mysql_operator.py 72.91% <72.91%> (ø)
airflow/jobs/scheduler_job.py 71.62% <0%> (-0.31%) ⬇️
airflow/utils/dag_processing.py 58.98% <0%> (+0.18%) ⬆️
airflow/hooks/dbapi_hook.py 88.59% <0%> (+0.87%) ⬆️
airflow/models/connection.py 65% <0%> (+1.11%) ⬆️
airflow/hooks/hive_hooks.py 77.6% <0%> (+1.78%) ⬆️
airflow/operators/latest_only_operator.py 92.85% <0%> (+2.85%) ⬆️
airflow/utils/sqlalchemy.py 79.06% <0%> (+4.65%) ⬆️
airflow/operators/mysql_operator.py 100% <0%> (+100%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19238f6...4ac5e1e. Read the comment docs.

@potiuk
Copy link
Member

potiuk commented Aug 7, 2019

@OmerJog - it's OK to keep it as it is now. We are going to make a big move of GCP operators soon following AIP-21 which has just been accepted and we are going to reshuffle the operators anyway to follow single convention.

@potiuk potiuk merged commit 3724c2a into apache:master Aug 7, 2019
@potiuk
Copy link
Member

potiuk commented Aug 7, 2019

Thanks @adussarps ! Really nice addition to the suite of GCP operators

@ashb
Copy link
Member

ashb commented Aug 14, 2019

@potiuk This commit adds a top-level __init__.py file which breaks some workflows, but the tests didn't break because of the whitelisting approach ;) https://issues.apache.org/jira/browse/AIRFLOW-5179

@potiuk
Copy link
Member

potiuk commented Aug 14, 2019

Ough :( . It is an edge case but I wonder if we could something to prevent it. For sure mounting whole source directory /adding it to .docker ignore is a bad idea - performance especially on Mac will suffer and docker caching will be impacted.

We could however make a simple test to validate that there are no unexpected entries at the top level and fail the build if there are some unexpected files/directories. We can simply whitelist those that are expected and add the files to that whitelist if we intentionally add it. I think that might solve the problem permanently. WDYT.

@ashb
Copy link
Member

ashb commented Aug 14, 2019

Could do that. It's probably an exception that won't ever byte us again anyway.

kaxil pushed a commit that referenced this pull request Dec 14, 2019
kaxil pushed a commit that referenced this pull request Dec 14, 2019
kaxil pushed a commit that referenced this pull request Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants