[SPARK-48824][SQL] Add Identity Column SQL syntax #47614

zhipengmao-db · 2024-08-05T16:26:39Z

What changes were proposed in this pull request?

Add SQL support for creating identity columns. Users can specify a column GENERATED ALWAYS AS IDENTITY(identityColumnSpec) , where identity values are always generated by the system, or GENERATED BY DEFAULT AS IDENTITY(identityColumnSpec), where users can specify the identity values.

Users can optionally specify the starting value of the column (default = 1) and the increment/step of the column (default = 1). Also we allow both
START WITH <start> INCREMENT BY <step>
and
INCREMENT BY <step> START WITH <start>

It allows flexible ordering of the increment and starting values, as both variants are used in the wild by other systems (e.g. PostgreSQL Oracle).

For example, we can define

CREATE TABLE default.example (
  id LONG GENERATED ALWAYS AS IDENTITY,
  id1 LONG GENERATED ALWAYS AS IDENTITY(),
  id2 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0),
  id3 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2),
  id4 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0 INCREMENT BY -10),
  id5 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2 START WITH -8),
  value LONG
)

This will enable defining identity columns in Spark SQL for data sources that support it.

To be more specific this PR

Adds parser support for GENERATED { BY DEFAULT | ALWAYS } AS IDENTITY in create/replace table statements. Identity column specifications are temporarily stored in the field's metadata, and then are parsed/verified in DataSourceV2Strategy and used to instantiate v2 [Column]
Adds TableCatalog::capabilities() and TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_IDENTITY_COLUMNS This will be used to determine whether to allow specifying identity columns or whether to throw an exception.

Why are the changes needed?

A SQL API is needed to create Identity Columns.

Does this PR introduce any user-facing change?

It allows the aforementioned SQL syntax to create identity columns in a table.

How was this patch tested?

Positive and negative unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IdentityColumn.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ColumnDefinition.scala

sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala

common/utils/src/main/resources/error/error-conditions.json

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalogCapability.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ColumnDefinition.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IdentityColumn.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

common/utils/src/main/resources/error/error-conditions.json

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

zhipengmao-db · 2024-09-02T15:07:50Z

@dtenedor Hi Daniel, thanks for reviewing the PR!

We changed the PR a bit to allow both IntegerType and LongType in the SQL interface to define an identity column, and it is up to the underlying framework that extends Spark to throw its own error if it doesn't want to support more than LongType. Are you fine with this change?

srielau · 2024-09-09T14:49:14Z

common/utils/src/main/resources/error/error-conditions.json

+      "DataType <dataType> is not supported for IDENTITY columns."
+    ],
+    "sqlState" : "428H2"
+  },


+1 Always nice to see thoughtful pick for SQLSTATE :-)

Address comments update Address the comments Change test name

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

cloud-fan · 2024-09-12T17:07:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IdentityColumn.scala

+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.types.{StructField, StructType}
+
+case class IdentityColumnSpec(start: Long, step: Long, allowExplicitInsert: Boolean)


This is a public API, and we shouldn't put it in org.apache.spark.sql.catalyst.util. How about making it a java class in org.apache.spark.sql.connector.catalog?

cloud-fan · 2024-09-12T17:09:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ColumnDefinition.scala

@@ -51,14 +56,22 @@ case class ColumnDefinition(
  }

  def toV2Column(statement: String): V2Column = {
+    val finalMetadata = if (identityColumnSpec.isDefined) {


why do we need this while v2 column already has an explicit field for identity column spec?

sql/api/src/main/java/org/apache/spark/sql/connector/catalog/IdentityColumnSpec.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

c27kwan · 2024-09-13T12:03:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+        }
+      } else {
+        throw SparkException
+            .internalError(s"Invalid identity column sequence generator option: ${option.getText}")


ParseException?

We also use internalError for unrecognized actions here:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

Line 761 in 5533c81

throw SparkException.internalError(

Otherwise we need to add a new error class for ParseException.

c27kwan · 2024-09-13T12:05:42Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

-      parameters = Map("error" -> "'a'", "hint" -> ": missing '('")
+      parameters = Map("error" -> "'a'", "hint" -> "")


Hm, did you change the results for generated columns?

It's because now after GENERATED ALWAYS AS there can be either ( or IDENTITY, so that the hint for the parse exception changes.

cloud-fan · 2024-09-13T13:49:05Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java

+   * Returns the identity column specification of this table column. Null means no identity column.
+   */
+  @Nullable
+  IdentityColumnSpec identityColumnSpec();


to double check, only one of the defaultValue, generationExpression and identityColumnSpec can be non-null, right?

Great catch! Yes only one of the defaultValue, generationExpression and identityColumnSpec can be non-null.
We blocked generation expression to be specified with identity column, but we did not block defaultValue to be specified with identity column. Will provide a fix in this PR.

cloud-fan · 2024-09-13T13:56:08Z

sql/api/src/main/java/org/apache/spark/sql/connector/catalog/IdentityColumnSpec.java

+/**
+ * Identity column specification.
+ */
+public class IdentityColumnSpec {


let's add @Evolving

cloud-fan · 2024-09-15T05:34:31Z

thanks, merging to master!

### What changes were proposed in this pull request? Add SQL support for creating identity columns. Users can specify a column `GENERATED ALWAYS AS IDENTITY(identityColumnSpec)` , where identity values are **always** generated by the system, or `GENERATED BY DEFAULT AS IDENTITY(identityColumnSpec)`, where users can specify the identity values. Users can optionally specify the starting value of the column (default = 1) and the increment/step of the column (default = 1). Also we allow both `START WITH <start> INCREMENT BY <step>` and `INCREMENT BY <step> START WITH <start>` It allows flexible ordering of the increment and starting values, as both variants are used in the wild by other systems (e.g. [PostgreSQL](https://www.postgresql.org/docs/current/sql-createsequence.html) [Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/CREATE-SEQUENCE.html#GUID-E9C78A8C-615A-4757-B2A8-5E6EFB130571)). For example, we can define ``` CREATE TABLE default.example ( id LONG GENERATED ALWAYS AS IDENTITY, id1 LONG GENERATED ALWAYS AS IDENTITY(), id2 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0), id3 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2), id4 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0 INCREMENT BY -10), id5 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2 START WITH -8), value LONG ) ``` This will enable defining identity columns in Spark SQL for data sources that support it. To be more specific this PR - Adds parser support for GENERATED { BY DEFAULT | ALWAYS } AS IDENTITY in create/replace table statements. Identity column specifications are temporarily stored in the field's metadata, and then are parsed/verified in DataSourceV2Strategy and used to instantiate v2 [Column] - Adds TableCatalog::capabilities() and TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_IDENTITY_COLUMNS This will be used to determine whether to allow specifying identity columns or whether to throw an exception. ### Why are the changes needed? A SQL API is needed to create Identity Columns. ### Does this PR introduce _any_ user-facing change? It allows the aforementioned SQL syntax to create identity columns in a table. ### How was this patch tested? Positive and negative unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47614 from zhipengmao-db/zhipengmao-db/SPARK-48824-id-syntax. Authored-by: zhipeng.mao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added SQL DOCS labels Aug 5, 2024

zhipengmao-db changed the title ~~[SPARK-48824] Add Identity Column sql syntax~~ [SPARK-48824] Add Identity Column SQL syntax Aug 6, 2024

zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from de11615 to cb7b13f Compare August 7, 2024 09:28

zhipengmao-db commented Aug 7, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IdentityColumn.scala Show resolved Hide resolved

zhipengmao-db commented Aug 7, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ColumnDefinition.scala Outdated Show resolved Hide resolved

c27kwan reviewed Aug 7, 2024

View reviewed changes

srielau approved these changes Aug 13, 2024

View reviewed changes

zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from 45c5e9f to 9632ce0 Compare August 16, 2024 11:57

zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from 8b5d619 to 284caea Compare August 23, 2024 12:17

dtenedor approved these changes Aug 30, 2024

View reviewed changes

zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from 6c99fc2 to 02ce0ac Compare September 9, 2024 08:17

srielau reviewed Sep 9, 2024

View reviewed changes

srielau approved these changes Sep 9, 2024

View reviewed changes

cloud-fan changed the title ~~[SPARK-48824] Add Identity Column SQL syntax~~ [SPARK-48824][SQL] Add Identity Column SQL syntax Sep 10, 2024

zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch 2 times, most recently from 9f2e916 to f3da61b Compare September 11, 2024 14:24

zhipengmao-db added 4 commits September 12, 2024 09:48

[SPARK-48824] Add Identity Column sql syntax

bde6e0b

Address comments update Address the comments Change test name

Allow integer type for identity column

c00bee8

Throw exception for unknown option & format

dd8b835

Fix merge error

77eae56

zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from f3da61b to 77eae56 Compare September 12, 2024 07:48

cloud-fan reviewed Sep 12, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 12, 2024

View reviewed changes

Address comments

a5f9ddf

c27kwan reviewed Sep 13, 2024

View reviewed changes

Address comments

54e9d3b

cloud-fan reviewed Sep 13, 2024

View reviewed changes

cloud-fan approved these changes Sep 13, 2024

View reviewed changes

Block identity column specified with default value

825d682

cloud-fan closed this in 931ab06 Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48824][SQL] Add Identity Column SQL syntax #47614

[SPARK-48824][SQL] Add Identity Column SQL syntax #47614

zhipengmao-db commented Aug 5, 2024 •

edited

Loading

zhipengmao-db commented Sep 2, 2024 •

edited

Loading

srielau Sep 9, 2024

cloud-fan Sep 12, 2024

cloud-fan Sep 12, 2024

c27kwan Sep 13, 2024

zhipengmao-db Sep 13, 2024 •

edited

Loading

c27kwan Sep 13, 2024

zhipengmao-db Sep 13, 2024 •

edited

Loading

cloud-fan Sep 13, 2024

zhipengmao-db Sep 13, 2024 •

edited

Loading

cloud-fan Sep 13, 2024

cloud-fan commented Sep 15, 2024

		parameters = Map("error" -> "'a'", "hint" -> ": missing '('")
		parameters = Map("error" -> "'a'", "hint" -> "")

[SPARK-48824][SQL] Add Identity Column SQL syntax #47614

[SPARK-48824][SQL] Add Identity Column SQL syntax #47614

Conversation

zhipengmao-db commented Aug 5, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhipengmao-db commented Sep 2, 2024 • edited Loading

srielau Sep 9, 2024

Choose a reason for hiding this comment

cloud-fan Sep 12, 2024

Choose a reason for hiding this comment

cloud-fan Sep 12, 2024

Choose a reason for hiding this comment

c27kwan Sep 13, 2024

Choose a reason for hiding this comment

zhipengmao-db Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

c27kwan Sep 13, 2024

Choose a reason for hiding this comment

zhipengmao-db Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

cloud-fan Sep 13, 2024

Choose a reason for hiding this comment

zhipengmao-db Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

cloud-fan Sep 13, 2024

Choose a reason for hiding this comment

cloud-fan commented Sep 15, 2024

zhipengmao-db commented Aug 5, 2024 •

edited

Loading

zhipengmao-db commented Sep 2, 2024 •

edited

Loading

zhipengmao-db Sep 13, 2024 •

edited

Loading

zhipengmao-db Sep 13, 2024 •

edited

Loading

zhipengmao-db Sep 13, 2024 •

edited

Loading