Add native support for data transfers optimisation #481

sunank200 · 2022-06-22T20:37:01Z

Please describe the feature you'd like to see
Currently, we do not support the native transfers between different file locations and databases in LoadFile operator even though the external provider library supports the optimised approach to transfer the data. For example, to transfer between S3 to Snowflake we could use COPY INTO

command to load data from a Cloud Storage bucket into tables that would make the transfer way faster.

Describe the solution you'd like
Me and @utkarsharma2 were discussing various approaches. One of the most optimised approaches looks like using existing airflow hooks to achieve that. For example, S3ToSnowflakeOperator has the whole logic already implemented. I propose using existing transfer operators from airflow providers where ever possible. If that provider doesn't exist we could implement that logic on our end on the respective location classes in astro-sdk-python.

For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible. In the LoadFile execute() method we can do the following:

Check if the Source file storage(input_file) to Destination database(output_table) optimised transfer exists or not. (This mapping can be maintained in each database and location classes). For each database class like SnowflakeDatabase, BigqueryDatabase we could maintain a list of source storage types from which the optimised transfer is possible.
If optimised transfer exists, fetch the relevant hooks from airflow providers and do the transfer. Another way would be to use the Transfer Operators implemented on airflow directly if it exists. ( Example - S3ToSnowflakeOperator)

Are there any alternatives to this feature?
Is there another way we could solve this problem or enable this use-case?
Next step would be to use chunk or split the file and do parallel async uploads. But using optimised transfers using airflow hooks would improve the performance drastically.

Acceptance Criteria

The text was updated successfully, but these errors were encountered:

Refactor how tables are created in BaseDatabase.load_file_to_table We should prioritise creating the table using the `table.columns` if they are specified by the user and have the dataframe autodetection as a fallback. Most of the complexity of #487 was the creation of tables, and this step aims to simplify the Snowflake `load_file` optimization. Relates to: #430, #481, #493, #494

sunank200 added the feature New feature or request label Jun 22, 2022

sunank200 self-assigned this Jun 22, 2022

sunank200 mentioned this issue Jun 23, 2022

Optimize loading datasets for snowflake [WIP] #487

Closed

2 tasks

tatiana self-assigned this Jul 14, 2022

sunank200 mentioned this issue Jul 26, 2022

Optimize Snowflake load_file using native COPY INTO #544

Merged

sunank200 closed this as completed Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native support for data transfers optimisation #481

Add native support for data transfers optimisation #481

sunank200 commented Jun 22, 2022 •

edited

Loading

Add native support for data transfers optimisation #481

Add native support for data transfers optimisation #481

Comments

sunank200 commented Jun 22, 2022 • edited Loading

sunank200 commented Jun 22, 2022 •

edited

Loading