-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition as args in SparkHiveDataSet #725
Comments
Hello @jpoullet2000, welcome to kedro, and thank you very much for the suggestion. I don't know too much about spark so can't make any particularly insightful comments, but it sounds like a very sensible idea. My main question would be about how to implement it. You suggest adding Possibly a better approach would be to just add a new
... and these are exactly the arguments used in the SQL query in
So if partitioning is a common requirement for this dataset it would seem best to add it as an argument to |
It makes sense to me, @AntonyMilneQB. |
@brendalf Go for it! You might like to have a quick read of our guide for contributors. |
Thanks guys!
Le jeu. 18 mars 2021 à 11:22, Antony Milne ***@***.***> a
écrit :
… @brendalf <https://github.com/brendalf> Go for it! You might like to have
a quick read of our guide for contributors
<https://github.com/quantumblacklabs/kedro/blob/master/CONTRIBUTING.md>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#725 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFHEHWI33PYJA34OM5O3ULTEHH5LANCNFSM4ZIZII6Q>
.
|
…o-develop Merge master into develop via merge-master-to-develop
Hi @AntonyMilneQB, can you take a look at #745? |
Hi @jpoullet2000, hope you're well! |
Hi,
Thanks a lot. I'm looking forward to testing it.
KR,
JB
…On Wed, May 5, 2021 at 4:59 PM Jiri Klein ***@***.***> wrote:
Hi @jpoullet2000 <https://github.com/jpoullet2000>, hope you're well!
SparkHiveDataSet has now been fully rewritten and partitionBy support has
been added, including access to other save_args.
You can find the changes in the latest develop code. If you wish to wait
for a proper release, you can expect these changes to materialise in
version 0.18.0.
Hope this helps!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#725 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFHEHXRYLWG3NEZVDSIE5LTMFMOJANCNFSM4ZIZII6Q>
.
|
Description
I can only partition my data with SparkDataSet and not SparkHiveDataSet whereas I want to save my data in a Hive table, and so use the _save_validate method to make sure the columns are OK.
Context
Saving my data in Hive without having this extra validation on the schema might be risky.
Possible Alternatives
Why not using something similar to SparkDataSet: have a "save_args" in "__init__" method of SparkHiveDataSet and passing the "partitionBy" item. Then use it in the _insert_save method, using the partitionBy in the SQL statement: something like "INSERT INTO [TABLE] [db_name.]table_name [PARTITION partitionBy] select_statement"
The text was updated successfully, but these errors were encountered: