-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support types other than String for partition columns on ListingTables #4221
Conversation
I think not all the types of columns can be used as partition columns, should there a white list for supported types? |
And can you explain a little how you will infer the partition column types or partition spec from the file paths ? |
Thank you @doki23 -- I think the functionality needs some test of a non-string partition column before we would consider merging it. Otherwise how would we know if we caused a regression (aka broke) this feature? |
Thanks for your reply. I believe the answers are same -- these types are provided by every datasource itself if I am not mistaken. So what I do is to make schema of |
Thank you. I did test them in delta lake but I forgot to add some unit tests for datafusion, I'll do it. |
It's because ListingTable treats type of partitioned columns as |
@mingmwang @alamb Would you please help review my pr? I think it's ready now. |
…nto nonstring-partitioned-cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @doki23 -- this looks good to me. I had some style suggestions but nothing critical.
I'll plan to merge this PR tomorrow -- let me know if you want to delay so you can make any more changes
@@ -56,7 +57,11 @@ async fn parquet_distinct_partition_col() -> Result<()> { | |||
"year=2021/month=10/day=09/file.parquet", | |||
"year=2021/month=10/day=28/file.parquet", | |||
], | |||
&["year", "month", "day"], | |||
&[ | |||
("year", DataType::Int32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
/// ``` | ||
pub fn with_table_partition_cols( | ||
mut self, | ||
table_partition_cols: Vec<String>, | ||
table_partition_cols: Vec<(String, DataType)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help me read the code if this was a named struct so the code could refer to .name
and .datatype
rather than .0
and .1
. But I don't think it is necessary to merge this PR
#[derive(Clone)]
struct PartitionColumn {
name: String,
data_type: DataType
}
@@ -876,7 +890,7 @@ impl AsLogicalPlan for LogicalPlanNode { | |||
FileFormatType::Avro(protobuf::AvroFormat {}) | |||
} else { | |||
return Err(proto_error(format!( | |||
"Error converting file format, {:?} is invalid as a datafusion foramt.", | |||
"Error converting file format, {:?} is invalid as a datafusion format.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@alamb Thank you. I'll make more changes according to your suggestions tomorrow to make it better. |
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- thank you @doki23
col.0.to_owned(), | ||
self.table_schema | ||
.field_with_name(&col.0) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to panic
if the user specifies a partition column that is not present.
It would be nice to make it an error -- can you either do so as a follow on PR or file a ticket?
Benchmark runs are scheduled for baseline = 502b7e3 and contender = 55bf8e9. 55bf8e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4218.
Rationale for this change
For supporting more types of partitioned columns.
What changes are included in this PR?
This pr makes datafusion to extract data type of partitioned columns from file groups instead of setting a default
Utf8
type.Are these changes tested?
Already passed all tests including every *_with_partitions test.
Are there any user-facing changes?
Yes. Users can query partitioned columns by delta table definition.