-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT]: Daft support for Azure storage for Unity Catalog daft.read_deltalake
#3025
[FEAT]: Daft support for Azure storage for Unity Catalog daft.read_deltalake
#3025
Conversation
CodSpeed Performance ReportMerging #3025 will not alter performanceComparing Summary
|
daft.read_deltalake
daft.read_deltalake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good overall! One question about the azure temp credentials stuff.
I think this PR could be cleaned up by adding a little IOConfig factory class, perhaps in a new daft.io.config_factories.*
module, where we can have two factories FromEnv
and FromUnity
and they can handle the deduplication of the if schema == "..."
logic for us.
But I would think that's not in scope here, we can do that as a follow-on cleanup.
pass | ||
elif scheme == "az" or scheme == "abfs" or scheme == "abfss": | ||
io_config = IOConfig( | ||
azure=AzureConfig(sas_token=temp_table_credentials.azure_user_delegation_sas.get("sas_token")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, do we know what version of the unity catalog SDK this requires? I feel like this might have been added in a later version and might need us to regenerate the SDK. Have you tested this yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jaychia , I think this should be fine for the existing version shipped out . I tested on Daft 0.3.4
and it includes unitycatalog==0.1.1
. On this SDK version, the sas_token
is being returned successfully.
I have tested this internally by walking through the function execution flow, using a virtual environment created with Daft 0.3.4
without any other upgrades and it has worked well for me. I could use the vended credentials in subsequent calls by instantiating an object of DeltaLakeScanOperator
from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing
@kevinzwang could you handle follow-up here WRT the SDK generation and getting this PR in? This looks really good already thanks @anilmenon14 !! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @anilmenon14 for contributing this!
Interesting that the azure_user_delegation_sas
value is not actually present in the generated client library, but if still works I am happy to merge it in right now. Once we get the unitycatalog v0.2 client library published, I think we'll work on using that anyway
There are 2 changes in this PR:
UnityCatalog
class methodload_table
:Now
azure_user_delegation_sas.get('sas_token')
is passed down as 'sas_token' inAzureConfig
in theIOConfig
object, after retrieving it from Unity catalog credential vending API endpoint. This token can be used downstream byDeltaLake
.Code that handles existing working functionality for S3 will remain unchanged and is now wrapped within conditional logic based on the storage system involved (i.e. S3, ADLS, or GCS)
DeltaLakeScanOperator
:Conditional logic to ensure that S3 handling logic does not negatively impact
daft.read_deltalake()
calls from other storage systems (E.g. Azure or GCS).deltalake_sdk_io_config
is left unchanged and passed down toDeltaLake
without any modifications.Conditional blocks have been placed for Azure and GCS for future implementations in case any special handling of
deltalake_sdk_io_config
has to be done within those blocks.Note: GCS support has not been added yet. However, based on a better understanding of the credential vending for GCS, can do that in a future PR.