Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligning COPY TO and CREATE TABLE Syntax #9369

Closed
metesynnada opened this issue Feb 27, 2024 · 7 comments · Fixed by #9604
Closed

Aligning COPY TO and CREATE TABLE Syntax #9369

metesynnada opened this issue Feb 27, 2024 · 7 comments · Fixed by #9604
Labels
enhancement New feature or request

Comments

@metesynnada
Copy link
Contributor

Is your feature request related to a problem or challenge?

There is a noticeable inconsistency between the syntax of COPY TO and CREATE EXTERNAL TABLE commands, particularly in how they handle the specification of data format and partitioning. The COPY TO command allows for these details to be included as part of a flexible options list, while CREATE EXTERNAL TABLE parses these details directly from the SQL statement.

Detailed Description:

  1. COPY TO Command Syntax:

    COPY (VALUES (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z'))
    TO 'test_files/scratch/copy/partitioned_table2/'
    (FORMAT PARQUET, PARTITION_BY 'column2, column3');

    In this command, the format and partitioning are specified as options within the command itself.

  2. CREATE EXTERNAL TABLE Command Syntax:

    CREATE EXTERNAL TABLE validate_partitioned_parquet
    STORED AS PARQUET
    LOCATION 'test_files/scratch/copy/partitioned_table1/'
    PARTITIONED BY (col2);

    Here, the format and partitioning details are part of the SQL command's structure, without the flexibility offered by an options list.

This syntactical inconsistency creates a learning curve and potential for confusion, as users must adapt to two different methods for specifying critical information like data format and partitioning, depending on the command they are using.

Describe the solution you'd like

I propose that a unified syntax approach be adopted for both commands. This could mean revising COPY TO to align with the direct parsing method of CREATE EXTERNAL TABLE. Additionally, updating the parser to support this unified syntax and providing detailed documentation for the same would be beneficial.

Describe alternatives you've considered

No response

Additional context

No response

@metesynnada metesynnada added the enhancement New feature or request label Feb 27, 2024
@Jefffrey
Copy link
Contributor

Related: #4808

I was intending on trying to move some of the syntax into sqlparser-rs (before I got distracted by other work), but would be happy to get more thoughts on this

@devinjdangelo
Copy link
Contributor

devinjdangelo commented Feb 28, 2024

The current copy statement syntax is based on duckdb https://duckdb.org/docs/sql/statements/copy.html.

Postgres also has a similar copy to syntax, though their options list is not as flexible and based on fixed keywords.

I think there is a middle ground between these two approaches where dfparser could parse some special keywords in the option list and pass through anything else as a HashMap or Vec. #9274 is related.

I'm also not opposed to pulling keywords out of the option list entirely to make copy support similar syntax to create external table. The flexible options list would have to stay of course to support format specific options like parquet row group size, but partition_by could be pulled out.

Something like

COPY table
TO 'file.parquet'
PARTITION BY (col1, col2)
(row_group_size 123)

@metesynnada
Copy link
Contributor Author

@devinjdangelo After reviewing DuckDB, I agree with you. Our COPY To syntax closely resembles theirs. However, DuckDB seems to have a self-consistent design, underscoring the importance of maintaining consistency in our SQL syntax too.

This is how they handle the CSV imports

While borrowing designs from other engines often works well because it minimizes unexpected elements, our user behaviour isn't uniformly consistent. There are likely multiple approaches to address it. I've suggested one method; I'm keen to hear what the community thinks as well.

cc @alamb @Dandandan

@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

Thank you for brining this up @metesynnada

I'm also not opposed to pulling keywords out of the option list entirely to make copy support similar syntax to create external table. The flexible options list would have to stay of course to support format specific options like parquet row group size, but partition_by could be pulled out.

I agree adding specific syntax to the COPY statement to better align with CREATE EXTERNAL TABLE is a good idea and I like the example in #9369 (comment)

cc @andygrove who may have additional context on the original CREATE EXTERNAL TABLE syntax

@metesynnada
Copy link
Contributor Author

Related: #4808

I was intending on trying to move some of the syntax into sqlparser-rs (before I got distracted by other work), but would be happy to get more thoughts on this

Can we align with CREATE EXTERNAL TABLE syntax, what do you think?

@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

Can we align with CREATE EXTERNAL TABLE syntax, what do you think?

I think one challenge is that CREATE EXTERNAL TABLE is a DataFusion thing (not some other standard dialect) so putting it upstream in sqlparser-rs may not be appropriate (as it is only for DataFusion)...

If we are to align COPY TO to look like CREATE EXTERNAL TABLE I think it should be done in the DataFusion parser rather than sqlparser

@metesynnada
Copy link
Contributor Author

Can we align with CREATE EXTERNAL TABLE syntax, what do you think?

I think one challenge is that CREATE EXTERNAL TABLE is a DataFusion thing (not some other standard dialect) so putting it upstream in sqlparser-rs may not be appropriate (as it is only for DataFusion)...

If we are to align COPY TO to look like CREATE EXTERNAL TABLE I think it should be done in the DataFusion parser rather than sqlparser

Datafusion owns the COPY statement, so this change will be quite easy. I am preparing a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants