Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading partitioned Parquet files #133

Closed
alamb opened this issue Apr 26, 2021 · 15 comments
Closed

Add support for reading partitioned Parquet files #133

alamb opened this issue Apr 26, 2021 · 15 comments
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11019

Add support for reading Parquet files that are partitioned by key where the files are under a directory structure based on partition keys and values.

/path/to/files/KEY1=value/KEY2=value/files

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@dispanser
Copy link

dispanser commented Apr 27, 2021

Is there any reason to limit this to parquet files? In spark, this functionality is shared between csv, json, orc and parquet.

Maybe the implementation could target the shared file listing in physical_plan::common::build_file_list() which seems to be shared between parquet and csv.

Considering #204 (adding partition pruning), it may be sensible to already implement the partition pruning logic early in the file listing procedure itself, as it could save on file listing operations, which tend to be expensive in particular on cloud storage (EBS).

I'd love to work on this, but I'd need a bit of guidance on the preferred approach.

@alamb
Copy link
Contributor Author

alamb commented Apr 28, 2021

Is there any reason to limit this to parquet files?

I do not think there is any reason to limit to parquet files. Parquet files are probably the most important usecase initially but the functionality would be useful for everyone

I think the first thing to do might be to write up a high level proposal (we have used google docs to good effect in the past). The first work needed (for this ticket) is probably to do a recursive directory traversal and find all parquet (or other) formats in subdirectories.

Then there is probably work to interpret paths as their relevant partition keys, and then implement partition pruning (based on the existing row group pruning code, I would think)

@nugend
Copy link

nugend commented May 11, 2021

Is there a name for this sort of thing? I've seen it called Hive partitioning somewhere, but I couldn't find any kind of standard, particularly regarding the way that values should be parsed into types.

@alamb
Copy link
Contributor Author

alamb commented May 12, 2021

I do not know of any standard -- the systems I have heard of basically "follow what hive did" -- though if someone else has a reference that would be great.

@jorgecarleitao
Copy link
Member

just to check, what hive did in this context is the column=X/, column=Y/, right?

@Dandandan
Copy link
Contributor

@jorgecarleitao yes

I am also not aware of any standard - also implementations do differ in some subtle ways. I think we have to compare to hive / spark / etc.

On the types - it depends if the type already is set in the schema or if some inference is used for the paths. I think we can first start with adding partition columns to the table schema so we can actually parse the locations based on the type - and add automatic detection of types (like CSV) later.

@houqp
Copy link
Member

houqp commented May 12, 2021

Hive partitioning is the most commonly used scheme, but there are other schemes as well, for example, the python arrow package supports both directory partitioning and hive partitioning: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html?highlight=partition.

I agree with @Dandandan that we should add the concept of partition column first, then tackle how we ser/de partition values from file paths. I can see us going the python arrow route as well, i.e. supporting multiple partitioning schemes.

@snoe925
Copy link

snoe925 commented Jul 1, 2021

The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.
I would like to have the dynamic discovery as the default. But there is a means to do explicit mappings in Athena/Presto SQL.
This is perhaps a companion to the feature requested in this issue. The benefit is perhaps faster operation as you don't have to scan the filesystem to discover partitions. A secondary benefit is using this scheme for version snapshot support. This is how delta-io works with Athena/Presto/Trino.

Here is an example of syntax. Definitely needs a Google Doc treatment to outline the details.

I just wanted to comment to show how one can split the filesystem / storage discovery from the idea of partitions. This is certainly easy syntax for test cases as 100% SQL based interaction.

CREATE EXTERNAL TABLE users (
first string,
last string,
username string
)
PARTITIONED BY (id string, id2 string) -- same as the create table column syntax
STORED AS PARQUET
-- omit LOCATION because we are going to explicitly partition with ALTER TABLE

ALTER TABLE user
ADD PARTITION (id='a', id2='02') LOCATION '/id=a/id=02/data.parquet'
ADD PARTITION (id='a', id2='03') LOCATION '/id=a/id=03/data.parquet'

This is perhaps a UNION ALL of hidden tables for each partition.

@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2021

The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.

I agree

@rdettai
Copy link
Contributor

rdettai commented Aug 30, 2021

I have tried to come up with a design document regarding table formats and partitioning:

Sorry its length. Inputs are very welcome!

@houqp
Copy link
Member

houqp commented Sep 3, 2021

Thank you @rdettai for the detailed write up, I recommend you sending it to the arrow dev mailing list too since it's a pretty major design change.

@houqp
Copy link
Member

houqp commented Oct 18, 2021

I think this can be closed now with @rdettai 's new awesome listing table provider.

@rdettai
Copy link
Contributor

rdettai commented Oct 18, 2021

ListingTable does not implement it yet, but I will open a PR, probably this week, to get started on it 😉

@houqp
Copy link
Member

houqp commented Oct 18, 2021

oh right, but at least we now have a single implementation to cover all file formats :D

@rdettai
Copy link
Contributor

rdettai commented Oct 18, 2021

@houqp I opened #1139 for adding the feature in the listing provider, we can close this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

8 participants