You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please describe the feature you'd like to see
Currently, we do not support the native transfers between different file locations and databases in LoadFile operator even though the external provider library supports the optimised approach to transfer the data. For example, to transfer between S3 to Snowflake we could use COPY INTO
command to load data from a Cloud Storage bucket into tables that would make the transfer way faster.
Describe the solution you'd like
Me and @utkarsharma2 were discussing various approaches. One of the most optimised approaches looks like using existing airflow hooks to achieve that. For example, S3ToSnowflakeOperator has the whole logic already implemented. I propose using existing transfer operators from airflow providers where ever possible. If that provider doesn't exist we could implement that logic on our end on the respective location classes in astro-sdk-python.
For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible. In the LoadFileexecute() method we can do the following:
Check if the Source file storage(input_file) to Destination database(output_table) optimised transfer exists or not. (This mapping can be maintained in each database and location classes). For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible.
If optimised transfer exists, fetch the relevant hooks from airflow providers and do the transfer. Another way would be to use the Transfer Operators implemented on airflow directly if it exists. ( Example - S3ToSnowflakeOperator)
Are there any alternatives to this feature?
Is there another way we could solve this problem or enable this use-case?
Next step would be to use chunk or split the file and do parallel async uploads. But using optimised transfers using airflow hooks would improve the performance drastically.
Refactor how tables are created in BaseDatabase.load_file_to_table
We should prioritise creating the table using the `table.columns` if they are specified by the user and have the dataframe autodetection as a fallback.
Most of the complexity of #487 was the creation of tables, and this step aims to simplify the Snowflake `load_file` optimization.
Relates to: #430, #481, #493, #494
Please describe the feature you'd like to see
command to load data from a Cloud Storage bucket into tables that would make the transfer way faster.Currently, we do not support the native transfers between different file locations and databases in
LoadFile
operator even though the external provider library supports the optimised approach to transfer the data. For example, to transfer between S3 to Snowflake we could use COPY INTODescribe the solution you'd like
Me and @utkarsharma2 were discussing various approaches. One of the most optimised approaches looks like using existing airflow hooks to achieve that. For example, S3ToSnowflakeOperator has the whole logic already implemented. I propose using existing transfer operators from airflow providers where ever possible. If that provider doesn't exist we could implement that logic on our end on the respective location classes in astro-sdk-python.
For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible. In the
LoadFile
execute()
method we can do the following:Check if the Source file storage(
input_file
) to Destination database(output_table
) optimised transfer exists or not. (This mapping can be maintained in each database and location classes). For each database class likeSnowflakeDatabase
,BigqueryDatabase
we could maintain a list of source storage types from which the optimised transfer is possible.If optimised transfer exists, fetch the relevant hooks from airflow providers and do the transfer. Another way would be to use the Transfer Operators implemented on airflow directly if it exists. ( Example - S3ToSnowflakeOperator)
Are there any alternatives to this feature?
Is there another way we could solve this problem or enable this use-case?
Next step would be to use chunk or split the file and do parallel async uploads. But using optimised transfers using airflow hooks would improve the performance drastically.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: