Add support for reading partitioned Parquet files #133

alamb · 2021-04-26T13:25:02Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11019

Add support for reading Parquet files that are partitioned by key where the files are under a directory structure based on partition keys and values.

/path/to/files/KEY1=value/KEY2=value/files

dispanser · 2021-04-27T19:17:59Z

Is there any reason to limit this to parquet files? In spark, this functionality is shared between csv, json, orc and parquet.

Maybe the implementation could target the shared file listing in physical_plan::common::build_file_list() which seems to be shared between parquet and csv.

Considering #204 (adding partition pruning), it may be sensible to already implement the partition pruning logic early in the file listing procedure itself, as it could save on file listing operations, which tend to be expensive in particular on cloud storage (EBS).

I'd love to work on this, but I'd need a bit of guidance on the preferred approach.

alamb · 2021-04-28T13:15:32Z

Is there any reason to limit this to parquet files?

I do not think there is any reason to limit to parquet files. Parquet files are probably the most important usecase initially but the functionality would be useful for everyone

I think the first thing to do might be to write up a high level proposal (we have used google docs to good effect in the past). The first work needed (for this ticket) is probably to do a recursive directory traversal and find all parquet (or other) formats in subdirectories.

Then there is probably work to interpret paths as their relevant partition keys, and then implement partition pruning (based on the existing row group pruning code, I would think)

nugend · 2021-05-11T20:02:17Z

Is there a name for this sort of thing? I've seen it called Hive partitioning somewhere, but I couldn't find any kind of standard, particularly regarding the way that values should be parsed into types.

alamb · 2021-05-12T10:16:03Z

I do not know of any standard -- the systems I have heard of basically "follow what hive did" -- though if someone else has a reference that would be great.

jorgecarleitao · 2021-05-12T12:15:56Z

just to check, what hive did in this context is the column=X/, column=Y/, right?

Dandandan · 2021-05-12T13:27:04Z

@jorgecarleitao yes

I am also not aware of any standard - also implementations do differ in some subtle ways. I think we have to compare to hive / spark / etc.

On the types - it depends if the type already is set in the schema or if some inference is used for the paths. I think we can first start with adding partition columns to the table schema so we can actually parse the locations based on the type - and add automatic detection of types (like CSV) later.

houqp · 2021-05-12T19:36:01Z

Hive partitioning is the most commonly used scheme, but there are other schemes as well, for example, the python arrow package supports both directory partitioning and hive partitioning: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html?highlight=partition.

I agree with @Dandandan that we should add the concept of partition column first, then tackle how we ser/de partition values from file paths. I can see us going the python arrow route as well, i.e. supporting multiple partitioning schemes.

snoe925 · 2021-07-01T16:13:47Z

The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.
I would like to have the dynamic discovery as the default. But there is a means to do explicit mappings in Athena/Presto SQL.
This is perhaps a companion to the feature requested in this issue. The benefit is perhaps faster operation as you don't have to scan the filesystem to discover partitions. A secondary benefit is using this scheme for version snapshot support. This is how delta-io works with Athena/Presto/Trino.

Here is an example of syntax. Definitely needs a Google Doc treatment to outline the details.

I just wanted to comment to show how one can split the filesystem / storage discovery from the idea of partitions. This is certainly easy syntax for test cases as 100% SQL based interaction.

CREATE EXTERNAL TABLE users (
first string,
last string,
username string
)
PARTITIONED BY (id string, id2 string) -- same as the create table column syntax
STORED AS PARQUET
-- omit LOCATION because we are going to explicitly partition with ALTER TABLE

ALTER TABLE user
ADD PARTITION (id='a', id2='02') LOCATION '/id=a/id=02/data.parquet'
ADD PARTITION (id='a', id2='03') LOCATION '/id=a/id=03/data.parquet'

This is perhaps a UNION ALL of hidden tables for each partition.

alamb · 2021-07-02T10:41:44Z

The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.

I agree

rdettai · 2021-08-30T14:33:51Z

I have tried to come up with a design document regarding table formats and partitioning:

https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing

Sorry its length. Inputs are very welcome!

houqp · 2021-09-03T06:11:09Z

Thank you @rdettai for the detailed write up, I recommend you sending it to the arrow dev mailing list too since it's a pretty major design change.

houqp · 2021-10-18T03:29:03Z

I think this can be closed now with @rdettai 's new awesome listing table provider.

rdettai · 2021-10-18T06:14:24Z

ListingTable does not implement it yet, but I will open a PR, probably this week, to get started on it 😉

houqp · 2021-10-18T06:41:09Z

oh right, but at least we now have a single implementation to cover all file formats :D

rdettai · 2021-10-18T10:31:58Z

@houqp I opened #1139 for adding the feature in the listing provider, we can close this one!

alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021

alamb mentioned this issue Apr 28, 2021

Add support for partition pruning #204

Closed

yjshen mentioned this issue Aug 25, 2021

Table Scan Enhancement Plan #944

Closed

7 tasks

rdettai mentioned this issue Aug 29, 2021

ObjectStore API to read from remote storage systems #950

Merged

rdettai mentioned this issue Aug 31, 2021

Moving cost based optimizations to physical planning #962

Closed

rdettai mentioned this issue Sep 16, 2021

Reorganize table providers by table format #1009

Closed

marklit mentioned this issue Oct 11, 2021

Selecting files with Glob pattern / regexp when registering a table roapi/roapi#90

Open

rdettai mentioned this issue Oct 18, 2021

Implement partitioned read in listing table provider #1139

Closed

Dandandan closed this as completed Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading partitioned Parquet files #133

Add support for reading partitioned Parquet files #133

alamb commented Apr 26, 2021

dispanser commented Apr 27, 2021 •

edited

Loading

alamb commented Apr 28, 2021

nugend commented May 11, 2021

alamb commented May 12, 2021

jorgecarleitao commented May 12, 2021

Dandandan commented May 12, 2021

houqp commented May 12, 2021

snoe925 commented Jul 1, 2021

alamb commented Jul 2, 2021

rdettai commented Aug 30, 2021

houqp commented Sep 3, 2021

houqp commented Oct 18, 2021

rdettai commented Oct 18, 2021

houqp commented Oct 18, 2021

rdettai commented Oct 18, 2021 •

edited

Loading

Add support for reading partitioned Parquet files #133

Add support for reading partitioned Parquet files #133

Comments

alamb commented Apr 26, 2021

dispanser commented Apr 27, 2021 • edited Loading

alamb commented Apr 28, 2021

nugend commented May 11, 2021

alamb commented May 12, 2021

jorgecarleitao commented May 12, 2021

Dandandan commented May 12, 2021

houqp commented May 12, 2021

snoe925 commented Jul 1, 2021

alamb commented Jul 2, 2021

rdettai commented Aug 30, 2021

houqp commented Sep 3, 2021

houqp commented Oct 18, 2021

rdettai commented Oct 18, 2021

houqp commented Oct 18, 2021

rdettai commented Oct 18, 2021 • edited Loading

dispanser commented Apr 27, 2021 •

edited

Loading

rdettai commented Oct 18, 2021 •

edited

Loading