Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074

Merged
merged 3 commits into from
Apr 12, 2024

Conversation

zhongyujiang
Copy link
Contributor

This PR adds a new catalog parameter use-nullable-query-schema to control whether to set all fields to null in CTAS and RTAS operations.
Currently, when using CTAS and RTAS to create tables, the fields of new tables are always marked as optional, even if thier source fileds are marked as required in the original table. By utilizing the parameter use-nullable-query-schema, we can control whether to preserve the nullability of fields when creating a new table using CTAS or RTAS.

Set use-nullable-query-schema to false to preserve the nullability of fields:

spark.sql.catalog.my_catalog= org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.use-nullable-query-schema=false
...

Releated Spark PR: SPARK-43390
Close #7771

@github-actions github-actions bot added the spark label Apr 1, 2024
@zhongyujiang
Copy link
Contributor Author

@amogh-jahagirdar @aokolnychyi @RussellSpitzer can you please review this? Thanks!

* SELECT ... and creating the table. If false, fields' nullability will be preserved when
* creating the table.
*/
private static final String TABLE_CREATE_NULLABLE_QUERY_SCHEMA = "use-nullable-query-schema";
Copy link
Contributor

@aokolnychyi aokolnychyi Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about the name. On one hand, it is not very descriptive. On the other hand, it matches the Spark API. Let me think about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of slight renaming of the variable name and removing the doc if the name is clear enough for better grouping. This is a private variable. We should add it to our docs, though.

private static final String USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS = "use-nullable-query-schema";
private static final boolean USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT = true;

private boolean useNullableQuerySchema = USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if there's a pointer to where use-nullable-query-schema is in the Spark API that would be good to see (I couldn't find anything with searching). IMO something that mentions "preserve" would be a better verb rather than "use" as well as something that clarifies this applies for CTAS/RTAS, since it's a bit more clear that we are essentially preserving the nullability from the source query. So something like preserve-ctas-rtas-nullability feels a bit more direct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super opinionated though, if there's something in Spark that's already following this naming then I'd agree to just follow that since it's less of a burden on a user to be aware of these different namings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to this method that we have to overload.

/**
 * If true, mark all the fields of the query schema as nullable when executing
 * CREATE/REPLACE TABLE ... AS SELECT ... and creating the table.
 */
default boolean useNullableQuerySchema() {
  return true;
}

I agree the name is not very clear but I also don't know if there is a lot of value in deviating from Spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method can be used in more use cases in the future so it is probably best to stick with what Spark calls it.

ImmutableMap.of(
"type", "hive",
"default-namespace", "default",
"use-nullable-query-schema", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why always false? Don't we want to test both values?

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good to me, a few minor comments.

@github-actions github-actions bot added the docs label Apr 12, 2024
@zhongyujiang
Copy link
Contributor Author

@aokolnychyi @amogh-jahagirdar Thanks for reviewing, comments have been addresed, please take a look when you have time.

@aokolnychyi aokolnychyi merged commit 81b3310 into apache:main Apr 12, 2024
32 checks passed
@aokolnychyi
Copy link
Contributor

Thanks, @zhongyujiang! Thanks for reviewing, @amogh-jahagirdar!

@zhongyujiang zhongyujiang deleted the CTAS-use-nullable-schema branch April 12, 2024 23:24
sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-nullable columns marked as nullable during table creation
3 participants