Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execution::context::sql should support creating partitioned dataset #1220

Closed
jimexist opened this issue Nov 2, 2021 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@jimexist
Copy link
Member

jimexist commented Nov 2, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A challenge to allow execution::context::sql supporting creation of partitioned dataset.

For now:
https://github.com/apache/arrow-datafusion/blob/75b8112ee33af81d6085be4a83a096bf965dbc89/datafusion/src/execution/context.rs#L187-L225

the line for table_partition_cols is empty

Describe the solution you'd like

Ideally we should allow the same syntax, but allow auto detection of partitioned dataset based on either the location is a file or a directory

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@jimexist jimexist added the enhancement New feature or request label Nov 2, 2021
@jimexist
Copy link
Member Author

jimexist commented Nov 2, 2021

related #1139

@jimexist
Copy link
Member Author

jimexist commented Nov 2, 2021

cc @rdettai and @alamb

@alamb
Copy link
Contributor

alamb commented Nov 2, 2021

Auto detecting common partitioning schemes seems like a good idea to me @jimexist

Something else I have been wondering is "how general do we want our partitioning to be" -- what @rdettai has implemented is the classic hive partitioning when partitioning by date looks like /foo/date=2021-10-01

But there are other ways to partition data (e.g. IOx has its own way to partition data into individual files but the partition metadata is stored in some in-memory catalog) -- it would be pretty cool to re-use all the partitioning infrastructure (as well as, for example, add more sophisticated partition pruning)

@rdettai
Copy link
Contributor

rdettai commented Nov 2, 2021

#1185 is also a follow up to #1139 that is closely related to this. Maybe we can merge the two issues and create subtasks?

@alamb my idea was that each standard/technique for getting the list of files (table catalog) should be a different provider. The ListingTable provider might handle folder structures that are slightly different from the hive one (e.g mytable/2021/11/02 instead of mytable/year=2021/month=11/day=02), but it focuses on setups where the partitions are encoded in the folder structure itself and are discovered by "listing" the file system. Most of the code inside the datasource/listing module should be specialized to do precisely that (e.g chose a listing strategy, parse the paths...). Everything else can (should 😉) be taken out and mutualized into a common module for reuse in other table providers 😊.

@jimexist
Copy link
Member Author

Duplicate of #1185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants