-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Snowflake load_file using native COPY INTO #544
Conversation
507ba2a
to
9aacba0
Compare
Codecov Report
@@ Coverage Diff @@
## main #544 +/- ##
==========================================
+ Coverage 92.59% 92.68% +0.08%
==========================================
Files 40 40
Lines 1594 1613 +19
Branches 205 206 +1
==========================================
+ Hits 1476 1495 +19
Misses 93 93
Partials 25 25
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with a minor comment. I think code coverage might need to be improved.
src/astro/databases/snowflake.py
Outdated
:param source_file: File from which we need to transfer data | ||
:param target_table: Table to which the content of the file will be loaded to | ||
:param if_exists: Strategy used to load (currently supported: "append" or "replace") | ||
:param native_support_kwargs: kwargs to be used by method involved in native support flow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be it useful for users to know what are different native_support_kwargs
supported? or may be reference to COPY INTO
documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunank200, please, have a look now!
I had already added the reference to the official docs on the bottom of the docstring, do you think I should add some other? Or place it in another place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is calling wrong method i assume. It should have called load_file_to_table_natively()
. Thats the reason coverage shows this test not being run.
0d603ac
to
d5cd2cd
Compare
db927c1
to
a794944
Compare
3525121
to
2e68f2f
Compare
95edc5c
to
4f01214
Compare
cc3742e
to
4f01214
Compare
Fix: #430 Co-authored-by: Ankit Chaurasia <[email protected]>
Fix: #430 Co-authored-by: Ankit Chaurasia <[email protected]>
We added the benchmark results to the PR and fixed the broken tests. Two checks are failing probably because of pypa/pip#11298 - which affects all branches. I just re-run the tests for a branch which was passing 4h ago, that wasn't changed and the same tests failed. I've contacted TP to get his advice, and in the meantime, we'll merge this PR. |
""" | ||
|
||
|
||
@aql.dataframe(identifiers_as_lower=False) | ||
@aql.dataframe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to handle capitalization more consistently across dataframes / loading file after implementing: #564
@@ -67,11 +67,11 @@ def example_amazon_s3_snowflake_transform(): | |||
) | |||
|
|||
temp_table_1 = aql.load_file( | |||
input_file=File(path=f"{s3_bucket}/ADOPTION_CENTER_1.csv"), | |||
input_file=File(path=f"{s3_bucket}/ADOPTION_CENTER_1_unquoted.csv"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV was very odd beforehand in the format:
"header1","header2"
"value1","value2"
The COPY INTO command wasn't acceptable in this format, and we decided that this file could be cleaned beforehand. Users may face similar issues in future.
| snowflake | hundred_kb | 9.19s | 45.4MB | 2.55s | 120.0ms | | ||
| snowflake | ten_mb | 10.9s | 47.51MB | 2.58s | 160.0ms | | ||
| snowflake | one_gb | 1.07min | 47.94MB | 8.7s | 5.67s | | ||
| snowflake | five_gb | 5.49min | 53.69MB | 18.76s | 1.6s | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll log a ticket for us to further investigate why we weren't able to run the benchmark with 10 GB for this implementation.
# Snowflake doesn't handle well mixed capitalisation of column name chars | ||
# we are handling this more gracefully in a separate PR | ||
if dataframe is not None: | ||
dataframe.columns.str.upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to handle capitalization more consistently across dataframes / loading file after implementing: #564
Fix: #430
closes: #481
Reduced the time to load to Snowflake by 20% for 5GB datasets (from 24.46 min to 5.49 min). Further details in the PR results file.
Co-authored-by: Ankit Chaurasia [email protected]