Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DataFusion specific statements #1080

Open
Jefffrey opened this issue Jan 2, 2024 · 4 comments
Open

Add support for DataFusion specific statements #1080

Jefffrey opened this issue Jan 2, 2024 · 4 comments

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 2, 2024

Context

arrow-datafusion currently implements its own parser, DFParser which wraps the parser in this crate in order to parse some DataFusion specific statements.

DataFusion issue: apache/datafusion#4808

Aiming to upstream this functionality into this crate to remove the parsing code from DataFusion.

Statements

Currently two custom statements that DataFusion parses: COPY TO ... and CREATE EXTERNAL TABLE ...

  • There is EXPLAIN too, but this is only there to support doing EXPLAIN for those new extensions

COPY TO

Syntax:

COPY <table_name> | (<query>)
TO '<destination>'
[ ( key1 value1, key2 value2) ]

Examples:

COPY lineitem TO '/path/to/lineitem.parquet' (format parquet, partitions 16);
COPY (SELECT * FROM lineitem) TO '/path/to/lineitem.parquet';

This is extremely similar to the COPY statement from PostgreSQL that sqlparser-rs already supports, with a few key differences:

  1. Doesn't support optional WITH keyword
  2. Only supports COPY TO and not COPY FROM
  3. Doesn't support column list when source is table
  4. Only supports target as string literal, not PROGRAM or STDOUT
  5. Options list doesn't constrain keys to a defined set

Points 2, 3 & 4 are non-issues since can inspect the Statement AST to check if want to support this:

https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/ast/mod.rs#L1463-L1479

Point 1 is a minor issue as the Statement AST above doesn't specify if there was a WITH keyword found when parsing, but at the same time DataFusion could just accept this new syntax since this optional keyword has minimal impact.

Point 5 is the major issue, as would either need to modify the fields of the existing Copy enum or add a new one specific for DataFusion, since it is a requirement that the keys cannot be constrained (would be parsed as String).

CREATE EXTERNAL TABLE

Syntax:

CREATE [ UNBOUNDED ] EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ]
[ DELIMITER <char> ]
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>)
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

<column_definition> := (<column_name> <data_type>, ...)

<column_list> := (<column_name>, ...)

<ordered_column_list> := (<column_name> <sort_clause>, ...)

<key_value_list> := (<literal> <literal, <literal> <literal>, ...)

Example:

CREATE UNBOUNDED EXTERNAL TABLE IF NOT EXISTS
kumachan (c1 int)
STORED AS CSV
WITH HEADER ROW
DELIMITER ','
COMPRESSION TYPE zstd
PARTITIONED BY (c1)
WITH ORDER (c1 asc)
OPTIONS (
  'k1' 'v1',
  'k2' 'v2'
)
LOCATION '/file'

This seems to vary heavily from the existing support for CreateTable:

https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/ast/mod.rs#L1561-L1604

  • Varies in the keyword parsing, such as WITH HEADER ROW and WITH ORDER

So might need a new statement for this? Or could try to retrofit onto the existing CreateTable statement.

Dialect

Also will need a new DataFusion dialect to support the above customization (e.g. to be able to toggle between previous/default behaviour for COPY TO to parse options as predefined keys, or as generic).

Alternative

Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.

I see there is this dialect function:

https://github.com/sqlparser-rs/sqlparser-rs/blob/a430d1a5a7bb04bbefd0f2fea07bf25c7fbce8b2/src/dialect/mod.rs#L171-L175

But this doesn't allow for custom statements.

I'm unsure what this could look like, but worth a thought.

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 2, 2024

I plan to take this on, probably starting first with COPY TO

@alamb
Copy link
Contributor

alamb commented Jan 2, 2024

Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.

I think this would be the best outcome -- if possible. I am not quite sure if we should be adding DataFusion specific things into this parser 🤔

Making it easier to extend this parser with custom hooks however sounds like a great idea to me

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 2, 2024

Instead of adding/modifying as described, could investigate ways to make it easier for downstream consumers to parse their own custom statements.

I think this would be the best outcome -- if possible. I am not quite sure if we should be adding DataFusion specific things into this parser 🤔

Making it easier to extend this parser with custom hooks however sounds like a great idea to me

Would that suggest that apache/datafusion#4808 would instead be focused on reducing the code of DFParser in DataFusion and not eliminate it entirely?

@alamb
Copy link
Contributor

alamb commented Jan 2, 2024

Would that suggest that apache/arrow-datafusion#4808 would instead be focused on reducing the code of DFParser in DataFusion and not eliminate it entirely?

I agree reducing the code is a better goal -- I also left some additional comments here: apache/datafusion#4808 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants