-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074
Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074
Conversation
@amogh-jahagirdar @aokolnychyi @RussellSpitzer can you please review this? Thanks! |
* SELECT ... and creating the table. If false, fields' nullability will be preserved when | ||
* creating the table. | ||
*/ | ||
private static final String TABLE_CREATE_NULLABLE_QUERY_SCHEMA = "use-nullable-query-schema"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mixed feelings about the name. On one hand, it is not very descriptive. On the other hand, it matches the Spark API. Let me think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of slight renaming of the variable name and removing the doc if the name is clear enough for better grouping. This is a private variable. We should add it to our docs, though.
private static final String USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS = "use-nullable-query-schema";
private static final boolean USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT = true;
private boolean useNullableQuerySchema = USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, if there's a pointer to where use-nullable-query-schema
is in the Spark API that would be good to see (I couldn't find anything with searching). IMO something that mentions "preserve" would be a better verb rather than "use" as well as something that clarifies this applies for CTAS/RTAS, since it's a bit more clear that we are essentially preserving the nullability from the source query. So something like preserve-ctas-rtas-nullability
feels a bit more direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super opinionated though, if there's something in Spark that's already following this naming then I'd agree to just follow that since it's less of a burden on a user to be aware of these different namings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to this method that we have to overload.
/**
* If true, mark all the fields of the query schema as nullable when executing
* CREATE/REPLACE TABLE ... AS SELECT ... and creating the table.
*/
default boolean useNullableQuerySchema() {
return true;
}
I agree the name is not very clear but I also don't know if there is a lot of value in deviating from Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method can be used in more use cases in the future so it is probably best to stick with what Spark calls it.
ImmutableMap.of( | ||
"type", "hive", | ||
"default-namespace", "default", | ||
"use-nullable-query-schema", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why always false? Don't we want to test both values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good to me, a few minor comments.
@aokolnychyi @amogh-jahagirdar Thanks for reviewing, comments have been addresed, please take a look when you have time. |
Thanks, @zhongyujiang! Thanks for reviewing, @amogh-jahagirdar! |
This PR adds a new catalog parameter
use-nullable-query-schema
to control whether to set all fields to null in CTAS and RTAS operations.Currently, when using CTAS and RTAS to create tables, the fields of new tables are always marked as optional, even if thier source fileds are marked as required in the original table. By utilizing the parameter
use-nullable-query-schema
, we can control whether to preserve the nullability of fields when creating a new table using CTAS or RTAS.Set
use-nullable-query-schema
tofalse
to preserve the nullability of fields:Releated Spark PR: SPARK-43390
Close #7771