Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43390][SQL] DSv2 allows CTAS/RTAS to reserve schema nullability #41070

Closed
wants to merge 3 commits into from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented May 5, 2023

What changes were proposed in this pull request?

Add a new method useNullableQuerySchema in Table, to allow the DataSource implementation to declare whether they need to reserve schema nullability on CTAS/RTAS.

Why are the changes needed?

SPARK-28837 forcibly uses the nullable schema on CTAS/RTAS, which seems too aggressive:

  1. The existing matured RDBMSs have different behaviors for reserving schema nullability on CTAS/RTAS, as mentioned in [SPARK-28837][SQL] CTAS/RTAS should use nullable schema #25536, PostgreSQL forcibly uses the nullable schema, but MySQL respects the query's output schema nullability.
  2. Some OLAP systems(e.g. ClickHouse) are perf-sensitive for nullable, and have strict restrictions on table schema, e.g. the primary keys are not allowed to be nullable.

Does this PR introduce any user-facing change?

Yes, this PR adds a new DSv2 API, but the default implementation reserves backward compatibility.

How was this patch tested?

UTs are updated.

@pan3793
Copy link
Member Author

pan3793 commented May 5, 2023

cc @cloud-fan @sunchao @beliefer

* Return whether to reserve schema nullability of query output or forcibly use nullable schema
* on creating table implicitly, e.g. CTAS/RTAS.
*/
default boolean createTableReserveSchemaNullability() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

/**
 * If true, mark all the fields of the query schema as nullable when executing
 * CREATE/REPLACE TABLE ... AS SELECT ... and creating the table.
 */
default boolean useNullableQuerySchema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, changed

@@ -3261,3 +3278,7 @@ class FakeV2Provider extends SimpleTableProvider {
throw new UnsupportedOperationException("Unnecessary for DDL tests")
}
}

class TestCreateTableReserveSchemaNullabilityCatalog extends InMemoryCatalog {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ForbidSchemaNullabilityCatalog ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"forbid" seems not accurate here, would prefer to use "reserve" to represent respecting the query schema nullability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I means the name seems too long. Could you rename it with simple one ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "ReserveSchemaNullabilityCatalog"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "ReserveSchemaNullabilityCatalog"

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ee9e36d May 9, 2023
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?

Add a new method `useNullableQuerySchema` in `Table`, to allow the DataSource implementation to declare whether they need to reserve schema nullability on CTAS/RTAS.

### Why are the changes needed?

SPARK-28837 forcibly uses the nullable schema on CTAS/RTAS, which seems too aggressive:

1. The existing matured RDBMSs have different behaviors for reserving schema nullability on CTAS/RTAS, as mentioned in apache#25536, PostgreSQL forcibly uses the nullable schema, but MySQL respects the query's output schema nullability.
2. Some OLAP systems(e.g. ClickHouse) are perf-sensitive for nullable, and have strict restrictions on table schema, e.g. the primary keys are not allowed to be nullable.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new DSv2 API, but the default implementation reserves backward compatibility.

### How was this patch tested?

UTs are updated.

Closes apache#41070 from pan3793/SPARK-43390.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants