execution::context::sql should support creating partitioned dataset #1220

jimexist · 2021-11-02T08:42:44Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A challenge to allow execution::context::sql supporting creation of partitioned dataset.

For now:
https://github.com/apache/arrow-datafusion/blob/75b8112ee33af81d6085be4a83a096bf965dbc89/datafusion/src/execution/context.rs#L187-L225

the line for table_partition_cols is empty

Describe the solution you'd like

Ideally we should allow the same syntax, but allow auto detection of partitioned dataset based on either the location is a file or a directory

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

jimexist · 2021-11-02T08:43:10Z

related #1139

jimexist · 2021-11-02T08:43:34Z

cc @rdettai and @alamb

alamb · 2021-11-02T13:05:31Z

Auto detecting common partitioning schemes seems like a good idea to me @jimexist

Something else I have been wondering is "how general do we want our partitioning to be" -- what @rdettai has implemented is the classic hive partitioning when partitioning by date looks like /foo/date=2021-10-01

But there are other ways to partition data (e.g. IOx has its own way to partition data into individual files but the partition metadata is stored in some in-memory catalog) -- it would be pretty cool to re-use all the partitioning infrastructure (as well as, for example, add more sophisticated partition pruning)

rdettai · 2021-11-02T13:37:30Z

#1185 is also a follow up to #1139 that is closely related to this. Maybe we can merge the two issues and create subtasks?

@alamb my idea was that each standard/technique for getting the list of files (table catalog) should be a different provider. The ListingTable provider might handle folder structures that are slightly different from the hive one (e.g mytable/2021/11/02 instead of mytable/year=2021/month=11/day=02), but it focuses on setups where the partitions are encoded in the folder structure itself and are discovered by "listing" the file system. Most of the code inside the datasource/listing module should be specialized to do precisely that (e.g chose a listing strategy, parse the paths...). Everything else can (should 😉) be taken out and mutualized into a common module for reuse in other table providers 😊.

jimexist · 2021-11-10T05:11:44Z

Duplicate of #1185

jimexist added the enhancement New feature or request label Nov 2, 2021

jimexist marked this as a duplicate of #1185 Nov 10, 2021

jimexist closed this as completed Nov 10, 2021

jimexist mentioned this issue Nov 10, 2021

Make table partitioning accessible in register/read APIS #1185

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

execution::context::sql should support creating partitioned dataset #1220

execution::context::sql should support creating partitioned dataset #1220

jimexist commented Nov 2, 2021 •

edited

Loading

jimexist commented Nov 2, 2021

jimexist commented Nov 2, 2021

alamb commented Nov 2, 2021

rdettai commented Nov 2, 2021 •

edited

Loading

jimexist commented Nov 10, 2021

execution::context::sql should support creating partitioned dataset #1220

execution::context::sql should support creating partitioned dataset #1220

Comments

jimexist commented Nov 2, 2021 • edited Loading

jimexist commented Nov 2, 2021

jimexist commented Nov 2, 2021

alamb commented Nov 2, 2021

rdettai commented Nov 2, 2021 • edited Loading

jimexist commented Nov 10, 2021

jimexist commented Nov 2, 2021 •

edited

Loading

rdettai commented Nov 2, 2021 •

edited

Loading