Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: omit min/max for interval columns when writing stats #5147

Merged
merged 2 commits into from
Nov 30, 2023

Conversation

Jefffrey
Copy link
Contributor

Which issue does this PR close?

Closes #5145

Rationale for this change

What changes are included in this PR?

Add extra checks before calculating min/max for chunks/pages, to ignore Interval columns

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 29, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@tustvold
Copy link
Contributor

What ColumnOrder are we currently writing for these columns?

@Jefffrey
Copy link
Contributor Author

What ColumnOrder are we currently writing for these columns?

I'm not sure, actually. I tried running this test in arrow_writer/mod.rs on master branch:

    #[test]
    fn test_123() {
        let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
        let b = IntervalDayTimeArray::from(vec![0; 5]);
        let batch = RecordBatch::try_from_iter(vec![
            ("a", Arc::new(a) as ArrayRef),
            ("b", Arc::new(b) as ArrayRef),
        ])
        .unwrap();

        let mut buf = Vec::with_capacity(1024);
        let mut writer = ArrowWriter::try_new(&mut buf, batch.schema(), None).unwrap();
        writer.write(&batch).unwrap();
        writer.close().unwrap();

        let bytes = Bytes::from(buf);
        let options = ReadOptionsBuilder::new().with_page_index().build();
        let reader = SerializedFileReader::new_with_options(bytes, options).unwrap();
        dbg!(reader.metadata().file_metadata().column_orders());
    }

Running:

arrow-rs$ cargo test -p parquet --lib arrow::arrow_writer::tests::test_123 -- --nocapture --exact
    Blocking waiting for file lock on build directory
   Compiling parquet v49.0.0 (/home/jeffrey/Code/arrow-rs/parquet)
    Finished test [unoptimized + debuginfo] target(s) in 11.49s
     Running unittests src/lib.rs (/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/deps/parquet-a4f7a499e85a325c)

running 1 test
[parquet/src/arrow/arrow_writer/mod.rs:2760] reader.metadata().file_metadata().column_orders() = None
test arrow::arrow_writer::tests::test_123 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 667 filtered out; finished in 0.00s

Even when I change it to only write the Int32Array, it is still none.

Not sure if I'm doing something wrong here?

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Nov 30, 2023

I noticed this:

let file_metadata = parquet::FileMetaData {
num_rows,
row_groups,
key_value_metadata,
version: self.props.writer_version().as_num(),
schema: types::to_thrift(self.schema.as_ref())?,
created_by: Some(self.props.created_by().to_owned()),
column_orders: None,
encryption_algorithm: None,
footer_signing_key_metadata: None,
};

  • Unsure if there are other places to consider

Looks like might be a separate issue, to implement writing ColumnOrder

@tustvold tustvold merged commit f621d28 into apache:master Nov 30, 2023
16 checks passed
@Jefffrey Jefffrey deleted the omit_interval_min_max_stats branch November 30, 2023 19:10
@Jefffrey
Copy link
Contributor Author

Raised #5152 for the column order issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Interval columns shouldn't write min/max stats
2 participants