You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the loader does not provide Databricks with the credentials needed to copy from S3. Instead, we require the user to have pre-configure their Databricks setup in quite a complicated way:
Create an iam role in the Databricks AWS sub-account, with permission to read from the S3 bucket of transformed events.
Give Databricks deployment role permission to assume the above role.
These complications will all go away if we provide the credentials in the COPY INTO statement. Furthermore, this is the only way we can load data cross-cloud from AWS Snowplow pipeline to Azure/GCP Databricks. The new statement should look something like:
COPY INTO atomic.eventsFROM (
SELECT<....columns....>FROM's3://mybucket/transformed/run=1'
)
WITH (
CREDENTIAL (AWS_ACCESS_KEY ='...', AWS_SECRET_KEY ='...', AWS_SESSION_TOKEN ='...')
)
FILEFORMAT = PARQUET
COPY_OPTIONS('MERGESCHEMA'='TRUE')
And something similar is also needed for the folder monitoring
The loader should generate the credentials using the aws sts sdk.
Implementation ideas
To start with, it is probably OK to create a new set of credentials each time we need to do a load. Doing it this way will keep the code quite clean, even though it might be more efficient to re-use credentials for multiple loads. The session duration can be set equal to the load timeout config setting.
This would a breaking change for the databricks loader, because it means the loader must run with a role that has access to the data; previously it was a principle that the loader should not have access to the data. We should consider making this sts feature configurable on/off (I'm 50:50 on whether we should do that).
One day, we might make a similar change for loading into Snowflake using temporary credentials -- but there are complications so we won't do it immediately. And we might also do it for loading into Redshift, but there is less urgency for Redshift.
The text was updated successfully, but these errors were encountered:
Currently, the loader does not provide Databricks with the credentials needed to copy from S3. Instead, we require the user to have pre-configure their Databricks setup in quite a complicated way:
These complications will all go away if we provide the credentials in the COPY INTO statement. Furthermore, this is the only way we can load data cross-cloud from AWS Snowplow pipeline to Azure/GCP Databricks. The new statement should look something like:
And something similar is also needed for the folder monitoring
The loader should generate the credentials using the aws sts sdk.
Implementation ideas
To start with, it is probably OK to create a new set of credentials each time we need to do a load. Doing it this way will keep the code quite clean, even though it might be more efficient to re-use credentials for multiple loads. The session duration can be set equal to the load timeout config setting.
This would a breaking change for the databricks loader, because it means the loader must run with a role that has access to the data; previously it was a principle that the loader should not have access to the data. We should consider making this sts feature configurable on/off (I'm 50:50 on whether we should do that).
One day, we might make a similar change for loading into Snowflake using temporary credentials -- but there are complications so we won't do it immediately. And we might also do it for loading into Redshift, but there is less urgency for Redshift.
The text was updated successfully, but these errors were encountered: