Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks loader: Generate STS tokens for copying from S3 #954

Closed
istreeter opened this issue Jun 27, 2022 · 0 comments
Closed

Databricks loader: Generate STS tokens for copying from S3 #954

istreeter opened this issue Jun 27, 2022 · 0 comments
Milestone

Comments

@istreeter
Copy link
Contributor

Currently, the loader does not provide Databricks with the credentials needed to copy from S3. Instead, we require the user to have pre-configure their Databricks setup in quite a complicated way:

  • Create an iam role in the Databricks AWS sub-account, with permission to read from the S3 bucket of transformed events.
  • Give Databricks deployment role permission to assume the above role.

These complications will all go away if we provide the credentials in the COPY INTO statement. Furthermore, this is the only way we can load data cross-cloud from AWS Snowplow pipeline to Azure/GCP Databricks. The new statement should look something like:

COPY INTO atomic.events
FROM (
  SELECT <....columns....> FROM 's3://mybucket/transformed/run=1'
)
WITH (
  CREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...')
)
FILEFORMAT = PARQUET
COPY_OPTIONS('MERGESCHEMA' = 'TRUE')

And something similar is also needed for the folder monitoring

The loader should generate the credentials using the aws sts sdk.


Implementation ideas

To start with, it is probably OK to create a new set of credentials each time we need to do a load. Doing it this way will keep the code quite clean, even though it might be more efficient to re-use credentials for multiple loads. The session duration can be set equal to the load timeout config setting.

This would a breaking change for the databricks loader, because it means the loader must run with a role that has access to the data; previously it was a principle that the loader should not have access to the data. We should consider making this sts feature configurable on/off (I'm 50:50 on whether we should do that).

One day, we might make a similar change for loading into Snowflake using temporary credentials -- but there are complications so we won't do it immediately. And we might also do it for loading into Redshift, but there is less urgency for Redshift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant