Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48824][SQL] Add Identity Column SQL syntax #47614

Conversation

zhipengmao-db
Copy link
Contributor

@zhipengmao-db zhipengmao-db commented Aug 5, 2024

What changes were proposed in this pull request?

Add SQL support for creating identity columns. Users can specify a column GENERATED ALWAYS AS IDENTITY(identityColumnSpec) , where identity values are always generated by the system, or GENERATED BY DEFAULT AS IDENTITY(identityColumnSpec), where users can specify the identity values.

Users can optionally specify the starting value of the column (default = 1) and the increment/step of the column (default = 1). Also we allow both
START WITH <start> INCREMENT BY <step>
and
INCREMENT BY <step> START WITH <start>

It allows flexible ordering of the increment and starting values, as both variants are used in the wild by other systems (e.g. PostgreSQL Oracle).

For example, we can define

CREATE TABLE default.example (
  id LONG GENERATED ALWAYS AS IDENTITY,
  id1 LONG GENERATED ALWAYS AS IDENTITY(),
  id2 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0),
  id3 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2),
  id4 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0 INCREMENT BY -10),
  id5 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2 START WITH -8),
  value LONG
)

This will enable defining identity columns in Spark SQL for data sources that support it.

To be more specific this PR

  • Adds parser support for GENERATED { BY DEFAULT | ALWAYS } AS IDENTITY in create/replace table statements. Identity column specifications are temporarily stored in the field's metadata, and then are parsed/verified in DataSourceV2Strategy and used to instantiate v2 [Column]
  • Adds TableCatalog::capabilities() and TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_IDENTITY_COLUMNS This will be used to determine whether to allow specifying identity columns or whether to throw an exception.

Why are the changes needed?

A SQL API is needed to create Identity Columns.

Does this PR introduce any user-facing change?

It allows the aforementioned SQL syntax to create identity columns in a table.

How was this patch tested?

Positive and negative unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@zhipengmao-db zhipengmao-db changed the title [SPARK-48824] Add Identity Column sql syntax [SPARK-48824] Add Identity Column SQL syntax Aug 6, 2024
@zhipengmao-db zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from de11615 to cb7b13f Compare August 7, 2024 09:28
@zhipengmao-db zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from 45c5e9f to 9632ce0 Compare August 16, 2024 11:57
@zhipengmao-db zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from 8b5d619 to 284caea Compare August 23, 2024 12:17
@zhipengmao-db
Copy link
Contributor Author

zhipengmao-db commented Sep 2, 2024

@dtenedor Hi Daniel, thanks for reviewing the PR!

We changed the PR a bit to allow both IntegerType and LongType in the SQL interface to define an identity column, and it is up to the underlying framework that extends Spark to throw its own error if it doesn't want to support more than LongType. Are you fine with this change?

@zhipengmao-db zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from 6c99fc2 to 02ce0ac Compare September 9, 2024 08:17
"DataType <dataType> is not supported for IDENTITY columns."
],
"sqlState" : "428H2"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Always nice to see thoughtful pick for SQLSTATE :-)

@cloud-fan cloud-fan changed the title [SPARK-48824] Add Identity Column SQL syntax [SPARK-48824][SQL] Add Identity Column SQL syntax Sep 10, 2024
@zhipengmao-db zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch 2 times, most recently from 9f2e916 to f3da61b Compare September 11, 2024 14:24
@zhipengmao-db zhipengmao-db force-pushed the zhipengmao-db/SPARK-48824-id-syntax branch from f3da61b to 77eae56 Compare September 12, 2024 07:48
import org.apache.spark.sql.errors.QueryCompilationErrors
import org.apache.spark.sql.types.{StructField, StructType}

case class IdentityColumnSpec(start: Long, step: Long, allowExplicitInsert: Boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public API, and we shouldn't put it in org.apache.spark.sql.catalyst.util. How about making it a java class in org.apache.spark.sql.connector.catalog?

@@ -51,14 +56,22 @@ case class ColumnDefinition(
}

def toV2Column(statement: String): V2Column = {
val finalMetadata = if (identityColumnSpec.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this while v2 column already has an explicit field for identity column spec?

}
} else {
throw SparkException
.internalError(s"Invalid identity column sequence generator option: ${option.getText}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseException?

Copy link
Contributor Author

@zhipengmao-db zhipengmao-db Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also use internalError for unrecognized actions here:

Otherwise we need to add a new error class for ParseException.

Comment on lines -2859 to +2861
parameters = Map("error" -> "'a'", "hint" -> ": missing '('")
parameters = Map("error" -> "'a'", "hint" -> "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, did you change the results for generated columns?

Copy link
Contributor Author

@zhipengmao-db zhipengmao-db Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because now after GENERATED ALWAYS AS there can be either ( or IDENTITY, so that the hint for the parse exception changes.

* Returns the identity column specification of this table column. Null means no identity column.
*/
@Nullable
IdentityColumnSpec identityColumnSpec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to double check, only one of the defaultValue, generationExpression and identityColumnSpec can be non-null, right?

Copy link
Contributor Author

@zhipengmao-db zhipengmao-db Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Yes only one of the defaultValue, generationExpression and identityColumnSpec can be non-null.
We blocked generation expression to be specified with identity column, but we did not block defaultValue to be specified with identity column. Will provide a fix in this PR.

/**
* Identity column specification.
*/
public class IdentityColumnSpec {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add @Evolving

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 931ab06 Sep 15, 2024
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

Add SQL support for creating identity columns. Users can specify a column `GENERATED ALWAYS AS IDENTITY(identityColumnSpec)` , where identity values are **always** generated by the system, or `GENERATED BY DEFAULT AS IDENTITY(identityColumnSpec)`, where users can specify the identity values.

Users can optionally specify the starting value of the column (default = 1) and the increment/step of the column (default = 1). Also we allow both
`START WITH <start> INCREMENT BY <step>`
and
`INCREMENT BY <step> START WITH <start>`

It allows flexible ordering of the increment and starting values, as both variants are used in the wild by other systems (e.g. [PostgreSQL](https://www.postgresql.org/docs/current/sql-createsequence.html) [Oracle](https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/CREATE-SEQUENCE.html#GUID-E9C78A8C-615A-4757-B2A8-5E6EFB130571)).

For example, we can define

```
CREATE TABLE default.example (
  id LONG GENERATED ALWAYS AS IDENTITY,
  id1 LONG GENERATED ALWAYS AS IDENTITY(),
  id2 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0),
  id3 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2),
  id4 LONG GENERATED BY DEFAULT AS IDENTITY(START WITH 0 INCREMENT BY -10),
  id5 LONG GENERATED ALWAYS AS IDENTITY(INCREMENT BY 2 START WITH -8),
  value LONG
)
```
This will enable defining identity columns in Spark SQL for data sources that support it.

To be more specific this PR

- Adds parser support for GENERATED { BY DEFAULT | ALWAYS } AS IDENTITY in create/replace table statements. Identity column specifications are temporarily stored in the field's metadata, and then are parsed/verified in DataSourceV2Strategy and used to instantiate v2 [Column]
- Adds TableCatalog::capabilities() and TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_IDENTITY_COLUMNS This will be used to determine whether to allow specifying identity columns or whether to throw an exception.

### Why are the changes needed?

A SQL API is needed to create Identity Columns.

### Does this PR introduce _any_ user-facing change?

It allows the aforementioned SQL syntax to create identity columns in a table.

### How was this patch tested?

Positive and negative unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47614 from zhipengmao-db/zhipengmao-db/SPARK-48824-id-syntax.

Authored-by: zhipeng.mao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants