Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support types other than String for partition columns on ListingTables #4221

Merged
merged 15 commits into from
Nov 23, 2022
Merged

Support types other than String for partition columns on ListingTables #4221

merged 15 commits into from
Nov 23, 2022

Conversation

doki23
Copy link
Contributor

@doki23 doki23 commented Nov 15, 2022

Which issue does this PR close?

Closes #4218.

Rationale for this change

For supporting more types of partitioned columns.

What changes are included in this PR?

This pr makes datafusion to extract data type of partitioned columns from file groups instead of setting a default Utf8 type.

Are these changes tested?

Already passed all tests including every *_with_partitions test.

Are there any user-facing changes?

Yes. Users can query partitioned columns by delta table definition.

@github-actions github-actions bot added the core Core DataFusion crate label Nov 15, 2022
@mingmwang
Copy link
Contributor

I think not all the types of columns can be used as partition columns, should there a white list for supported types?

@mingmwang
Copy link
Contributor

And can you explain a little how you will infer the partition column types or partition spec from the file paths ?
And what happened if there are conflicts?

@alamb
Copy link
Contributor

alamb commented Nov 15, 2022

Thank you @doki23 -- I think the functionality needs some test of a non-string partition column before we would consider merging it.

Otherwise how would we know if we caused a regression (aka broke) this feature?

@alamb alamb marked this pull request as draft November 15, 2022 19:42
@doki23
Copy link
Contributor Author

doki23 commented Nov 16, 2022

I think not all the types of columns can be used as partition columns, should there a white list for supported types?

And can you explain a little how you will infer the partition column types or partition spec from the file paths ? And what happened if there are conflicts?

Thanks for your reply. I believe the answers are same -- these types are provided by every datasource itself if I am not mistaken. So what I do is to make schema of RecordBatch matches that of table definition @mingmwang .

@doki23
Copy link
Contributor Author

doki23 commented Nov 16, 2022

Thank you @doki23 -- I think the functionality needs some test of a non-string partition column before we would consider merging it.

Otherwise how would we know if we caused a regression (aka broke) this feature?

Thank you. I did test them in delta lake but I forgot to add some unit tests for datafusion, I'll do it.

@doki23
Copy link
Contributor Author

doki23 commented Nov 16, 2022

It's because ListingTable treats type of partitioned columns as Utf8. I plan to change the function FileScanConfig.project() to provide a correct schema used by the physical plan created by ListingTable.

@doki23 doki23 marked this pull request as ready for review November 21, 2022 09:38
@doki23
Copy link
Contributor Author

doki23 commented Nov 21, 2022

@mingmwang @alamb Would you please help review my pr? I think it's ready now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @doki23 -- this looks good to me. I had some style suggestions but nothing critical.

I'll plan to merge this PR tomorrow -- let me know if you want to delay so you can make any more changes

datafusion/core/src/datasource/listing_table_factory.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/listing_table_factory.rs Outdated Show resolved Hide resolved
@@ -56,7 +57,11 @@ async fn parquet_distinct_partition_col() -> Result<()> {
"year=2021/month=10/day=09/file.parquet",
"year=2021/month=10/day=28/file.parquet",
],
&["year", "month", "day"],
&[
("year", DataType::Int32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

/// ```
pub fn with_table_partition_cols(
mut self,
table_partition_cols: Vec<String>,
table_partition_cols: Vec<(String, DataType)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help me read the code if this was a named struct so the code could refer to .name and .datatype rather than .0 and .1. But I don't think it is necessary to merge this PR

#[derive(Clone)]
struct PartitionColumn {
  name: String,
  data_type: DataType
}

@@ -876,7 +890,7 @@ impl AsLogicalPlan for LogicalPlanNode {
FileFormatType::Avro(protobuf::AvroFormat {})
} else {
return Err(proto_error(format!(
"Error converting file format, {:?} is invalid as a datafusion foramt.",
"Error converting file format, {:?} is invalid as a datafusion format.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@doki23
Copy link
Contributor Author

doki23 commented Nov 22, 2022

@alamb Thank you. I'll make more changes according to your suggestions tomorrow to make it better.

@doki23 doki23 requested a review from alamb November 23, 2022 01:19
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- thank you @doki23

col.0.to_owned(),
self.table_schema
.field_with_name(&col.0)
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to panic if the user specifies a partition column that is not present.

It would be nice to make it an error -- can you either do so as a follow on PR or file a ticket?

@alamb alamb changed the title Nonstring partitioned cols Support types other than String for partition columns on ListingTables Nov 23, 2022
@alamb alamb merged commit 55bf8e9 into apache:master Nov 23, 2022
@ursabot
Copy link

ursabot commented Nov 23, 2022

Benchmark runs are scheduled for baseline = 502b7e3 and contender = 55bf8e9. 55bf8e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support scan non-string columns partitioned parquet files
4 participants