Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074

zhongyujiang · 2024-04-01T12:16:07Z

This PR adds a new catalog parameter use-nullable-query-schema to control whether to set all fields to null in CTAS and RTAS operations.
Currently, when using CTAS and RTAS to create tables, the fields of new tables are always marked as optional, even if thier source fileds are marked as required in the original table. By utilizing the parameter use-nullable-query-schema, we can control whether to preserve the nullability of fields when creating a new table using CTAS or RTAS.

Set use-nullable-query-schema to false to preserve the nullability of fields:

spark.sql.catalog.my_catalog= org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.use-nullable-query-schema=false
...

Releated Spark PR: SPARK-43390
Close #7771

zhongyujiang · 2024-04-01T12:18:00Z

@amogh-jahagirdar @aokolnychyi @RussellSpitzer can you please review this? Thanks!

aokolnychyi · 2024-04-11T01:09:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java

+   * SELECT ... and creating the table. If false, fields' nullability will be preserved when
+   * creating the table.
+   */
+  private static final String TABLE_CREATE_NULLABLE_QUERY_SCHEMA = "use-nullable-query-schema";


I have mixed feelings about the name. On one hand, it is not very descriptive. On the other hand, it matches the Spark API. Let me think about it.

I'd be in favor of slight renaming of the variable name and removing the doc if the name is clear enough for better grouping. This is a private variable. We should add it to our docs, though.

private static final String USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS = "use-nullable-query-schema"; private static final boolean USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT = true; private boolean useNullableQuerySchema = USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT;

Hm, if there's a pointer to where use-nullable-query-schema is in the Spark API that would be good to see (I couldn't find anything with searching). IMO something that mentions "preserve" would be a better verb rather than "use" as well as something that clarifies this applies for CTAS/RTAS, since it's a bit more clear that we are essentially preserving the nullability from the source query. So something like preserve-ctas-rtas-nullability feels a bit more direct.

Not super opinionated though, if there's something in Spark that's already following this naming then I'd agree to just follow that since it's less of a burden on a user to be aware of these different namings.

I was referring to this method that we have to overload.

/** * If true, mark all the fields of the query schema as nullable when executing * CREATE/REPLACE TABLE ... AS SELECT ... and creating the table. */ default boolean useNullableQuerySchema() { return true; }

I agree the name is not very clear but I also don't know if there is a lot of value in deviating from Spark.

That method can be used in more use cases in the future so it is probably best to stick with what Spark calls it.

aokolnychyi · 2024-04-11T01:31:55Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestSparkCatalogOperations.java

+        ImmutableMap.of(
+            "type", "hive",
+            "default-namespace", "default",
+            "use-nullable-query-schema", "false")


Why always false? Don't we want to test both values?

aokolnychyi

This looks mostly good to me, a few minor comments.

zhongyujiang · 2024-04-12T11:28:38Z

@aokolnychyi @amogh-jahagirdar Thanks for reviewing, comments have been addresed, please take a look when you have time.

aokolnychyi · 2024-04-12T21:12:44Z

Thanks, @zhongyujiang! Thanks for reviewing, @amogh-jahagirdar!

…ache#10074)

Spark-3.5: Support CTAS and RTAS to preserve schema nullability.

852ee3c

github-actions bot added the spark label Apr 1, 2024

zhongyujiang mentioned this pull request Apr 1, 2024

Non-nullable columns marked as nullable during table creation #7771

Closed

amogh-jahagirdar requested review from aokolnychyi and amogh-jahagirdar April 9, 2024 15:24

aokolnychyi reviewed Apr 11, 2024

View reviewed changes

zhongyujiang added 2 commits April 12, 2024 19:26

Fix variable naming and improve tests.

e1b62cf

Add doc.

7e2dfe9

github-actions bot added the docs label Apr 12, 2024

aokolnychyi approved these changes Apr 12, 2024

View reviewed changes

aokolnychyi merged commit 81b3310 into apache:main Apr 12, 2024
32 checks passed

zhongyujiang deleted the CTAS-use-nullable-schema branch April 12, 2024 23:24

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Spark 3.5: Support preserving schema nullability in CTAS and RTAS (ap…

7cc6925

…ache#10074)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074

Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074

zhongyujiang commented Apr 1, 2024

zhongyujiang commented Apr 1, 2024

aokolnychyi Apr 11, 2024 •

edited

Loading

aokolnychyi Apr 11, 2024

amogh-jahagirdar Apr 11, 2024

amogh-jahagirdar Apr 11, 2024

aokolnychyi Apr 11, 2024

aokolnychyi Apr 11, 2024

aokolnychyi Apr 11, 2024

aokolnychyi left a comment

zhongyujiang commented Apr 12, 2024

aokolnychyi commented Apr 12, 2024

Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074

Spark-3.5: Support CTAS and RTAS to preserve schema nullability. #10074

Conversation

zhongyujiang commented Apr 1, 2024

zhongyujiang commented Apr 1, 2024

aokolnychyi Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Apr 11, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Apr 11, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Apr 11, 2024

Choose a reason for hiding this comment

aokolnychyi Apr 11, 2024

Choose a reason for hiding this comment

aokolnychyi Apr 11, 2024

Choose a reason for hiding this comment

aokolnychyi Apr 11, 2024

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

zhongyujiang commented Apr 12, 2024

aokolnychyi commented Apr 12, 2024

aokolnychyi Apr 11, 2024 •

edited

Loading