Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to GCP connections that define keyfile_dict instead of keyfile #352

Merged
merged 9 commits into from
Jul 24, 2023

Conversation

JoeSham
Copy link
Contributor

@JoeSham JoeSham commented Jun 28, 2023

Add support to Google Cloud Platform connections that define keyfile_dict (actual value) instead of keyfile (path).

A design decision for this implementation was to not add keyfile_json to secret_fields. This was not done because this property is originally a JSON. While storing it as an environment variable is simple, we'd need more significant changes to our profile parsing to correctly render this to the profile.yml generated by Cosmos.

This used to work in Cosmos 0.6.x and stopped working in 0.7.x as part of a previous profile refactors #271.

Closes: #350

Co-authored-by: Tatiana Al-Chueyr [email protected]

How to validate this change

  1. Have a GCP BQ service account with the BigQuery Data Editor role
  2. Create a namespace that can be accessed by the service account (1)
  3. Create an Airflow GCP connection that uses keyfile (path to the service account credentials saved locally). Example (replace some-namespace (2) and key_path(1)):
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='{"conn_type": "google_cloud_platform", "extra": {"key_path": "/home/some-user/key.json", "scope": "https://www.googleapis.com/auth/cloud-platform", "project": "astronomer-dag-authoring", "dataset": "some-namespace" , "num_retries": 5}}'
  1. Change the basic_cosmos_dag.py with the following lines, making sure it references the Airflow connection created in (3) and the dataset created in (2):
   conn_id="google_cloud_default",
   profile_args={
       "dataset": "some-namespace",
   },
  1. Run the DAG, for instance:
 PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 airflow dags test basic_cosmos_dag `date -Iseconds`
  1. Change the Airflow GCP connection to use keyfile_dict (hard-code the keyfile content in the Airflow connection, replacing <your keyfile content here>)
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='{"conn_type": "google_cloud_platform", "extra": {"keyfile_dict": <your keyfile content here>, "scope": "https://www.googleapis.com/auth/cloud-platform", "project": "astronomer-dag-authoring", "dataset": "cosmos" , "num_retries": 5}}'
  1. Run the previously created Cosmos-powered DAG (5) that confirms (6) works

@netlify
Copy link

netlify bot commented Jun 28, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit 68308b5
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/64be7c079b411c0008ff2a45

@tatiana
Copy link
Collaborator

tatiana commented Jul 18, 2023

@JoeSham I wrote some tests for this feature in 615a256. Feel free to cherry-pick them! I left one last comment, and I believe we'll be good to merge.

@codecov
Copy link

codecov bot commented Jul 18, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -10.63 ⚠️

Comparison is base (1058544) 91.18% compared to head (68308b5) 80.55%.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #352       +/-   ##
===========================================
- Coverage   91.18%   80.55%   -10.63%     
===========================================
  Files          45       46        +1     
  Lines        1565     1579       +14     
===========================================
- Hits         1427     1272      -155     
- Misses        138      307      +169     
Impacted Files Coverage Δ
cosmos/operators/local.py 56.06% <ø> (-25.11%) ⬇️
cosmos/profiles/__init__.py 92.85% <100.00%> (+0.26%) ⬆️
cosmos/profiles/base.py 95.18% <100.00%> (+0.11%) ⬆️
cosmos/profiles/bigquery/__init__.py 100.00% <100.00%> (ø)
.../profiles/bigquery/service_account_keyfile_dict.py 100.00% <100.00%> (ø)

... and 9 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tatiana tatiana changed the title Add support for BQ SA keyfile_dict Add support to GCP connections that define keyfile_dict instead of keyfile Jul 23, 2023
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @JoeSham ! This looks great.
I triggered Cosmos Airflow DAGs with both ways of creating a BQ profile and confirmed it works as expected.

@tatiana tatiana marked this pull request as ready for review July 24, 2023 13:15
@tatiana tatiana requested a review from a team as a code owner July 24, 2023 13:15
@tatiana tatiana requested a review from a team July 24, 2023 13:15
@tatiana tatiana merged commit f153c49 into astronomer:main Jul 24, 2023
38 of 39 checks passed
@joppedh
Copy link

joppedh commented Aug 7, 2023

@JoeSham @tatiana I am trying to get this working.
I added the connection and it gets rendered like expected:

bigquery:
    outputs:
        dev:
            dataset: my_bigquery_dataset
            keyfile_json: '{   "type": "service_account",   "project_id": "stg",   "universe_domain":
                "googleapis.com" }'
            method: service-account-json
            threads: 1
            type: bigquery

But that is not supported by DBT. It does not compile. Do we need to adjust the format of a service account JSON to make it work?

@jlaneve
Copy link
Collaborator

jlaneve commented Aug 7, 2023

It looks like maybe this is supposed to write the keyfile_json value as yaml instead of the json string we have now?

https://docs.getdbt.com/docs/core/connect-data-platform/bigquery-setup#service-account-json

@joppedh
Copy link

joppedh commented Aug 7, 2023

@jlaneve Exactly. I think we still need to unpack the JSON and create yaml fields of it.

@jlaneve
Copy link
Collaborator

jlaneve commented Aug 7, 2023

@jlaneve Exactly. I think we still need to unpack the JSON and create yaml fields of it.

I agree. Do you mind opening an issue for this? We're also very open to contributions in case you'd like to open a PR 🙂

@tatiana
Copy link
Collaborator

tatiana commented Aug 7, 2023

@joppedh @jlaneve By the time this PR was done and the steps in the description were followed, it worked. Cosmos was deserializing the dictionary into a YML, and we could run the jaffle_shop example pointing to BQ.

@joppedh could you confirm if you tested with this exact version of Cosmos or another one? It may be a bug of 1.0, which contains many other changes.

+1 for adding a bug ticket with the steps to reproduce the issue and the version of Cosmos where the issue is.

@joppedh
Copy link

joppedh commented Aug 8, 2023

Hey @tatiana You are right. I does work when I create connection from the command line like airflow connection add...
However, it does not work is if you add credentials manual from the webserver. This happens because of how Airflow stores it JSON. When we read it out it will be a string and not a dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCP BigQuery service account json not supported in v0.7
4 participants