Databricks loader: Generate STS tokens for copying from S3 #954

istreeter · 2022-06-27T11:37:35Z

Currently, the loader does not provide Databricks with the credentials needed to copy from S3. Instead, we require the user to have pre-configure their Databricks setup in quite a complicated way:

Create an iam role in the Databricks AWS sub-account, with permission to read from the S3 bucket of transformed events.
Give Databricks deployment role permission to assume the above role.

These complications will all go away if we provide the credentials in the COPY INTO statement. Furthermore, this is the only way we can load data cross-cloud from AWS Snowplow pipeline to Azure/GCP Databricks. The new statement should look something like:

COPY INTO atomic.events
FROM (
  SELECT <....columns....> FROM 's3://mybucket/transformed/run=1'
)
WITH (
  CREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...')
)
FILEFORMAT = PARQUET
COPY_OPTIONS('MERGESCHEMA' = 'TRUE')

And something similar is also needed for the folder monitoring

The loader should generate the credentials using the aws sts sdk.

Implementation ideas

To start with, it is probably OK to create a new set of credentials each time we need to do a load. Doing it this way will keep the code quite clean, even though it might be more efficient to re-use credentials for multiple loads. The session duration can be set equal to the load timeout config setting.

This would a breaking change for the databricks loader, because it means the loader must run with a role that has access to the data; previously it was a principle that the loader should not have access to the data. We should consider making this sts feature configurable on/off (I'm 50:50 on whether we should do that).

One day, we might make a similar change for loading into Snowflake using temporary credentials -- but there are complications so we won't do it immediately. And we might also do it for loading into Redshift, but there is less urgency for Redshift.

istreeter mentioned this issue Jun 27, 2022

Snowflake loader: use STS tokens for copying from S3 #955

Closed

spenes added a commit that referenced this issue Jun 30, 2022

Databricks loader: Generate STS tokens for copying from S3 (close #954)

568c053

spenes added a commit that referenced this issue Jun 30, 2022

Databricks loader: Generate STS tokens for copying from S3 (close #954)

057865b

spenes added a commit that referenced this issue Jun 30, 2022

Databricks loader: Generate STS tokens for copying from S3 (close #954)

b36d2b4

spenes added a commit that referenced this issue Jul 4, 2022

Databricks loader: Generate STS tokens for copying from S3 (close #954)

c38cdef

spenes added a commit that referenced this issue Jul 4, 2022

Databricks loader: Generate STS tokens for copying from S3 (close #954)

a9a3971

istreeter added this to the 4.2.0 milestone Jul 4, 2022

spenes closed this as completed in ef3f374 Jul 19, 2022

istreeter mentioned this issue Jul 22, 2022

Redshift: Use STS tokens for copying from S3 #998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks loader: Generate STS tokens for copying from S3 #954

Databricks loader: Generate STS tokens for copying from S3 #954

istreeter commented Jun 27, 2022

Databricks loader: Generate STS tokens for copying from S3 #954

Databricks loader: Generate STS tokens for copying from S3 #954

Comments

istreeter commented Jun 27, 2022

Implementation ideas