Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ThriftMetadataWriter for writing Parquet metadata #6197

Merged
merged 27 commits into from
Aug 6, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
741bbf6
bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)
BugenZhao Jul 16, 2024
8f76248
Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily acciden…
XiangpengHao Jul 16, 2024
bb5f12b
Make display of interval types more pretty (#6006)
Rachelint Jul 16, 2024
756b1fb
Update snafu (#5930)
Jesse-Bakker Jul 16, 2024
fe04e09
Update Parquet thrift generated structures (#6045)
etseidl Jul 16, 2024
2e7f7ef
Revert "Revert "Write Bloom filters between row groups instead of the…
alamb Jul 16, 2024
effccc1
Revert "Update snafu (#5930)" (#6069)
alamb Jul 16, 2024
649d09d
Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)
crepererum Jul 17, 2024
05e681d
remove repeated codes to make the codes more concise. (#6080)
Rachelint Jul 18, 2024
e40b311
Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)
etseidl Jul 19, 2024
81c34ac
Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)
dependabot[bot] Jul 23, 2024
3bc9987
Deprecate read_page_locations() and simplify offset index in `Parquet…
etseidl Jul 23, 2024
095130f
Merge remote-tracking branch 'apache/master' into 53.0.0-dev
alamb Jul 25, 2024
a6353d1
Update parquet/src/column/writer/mod.rs
alamb Jul 25, 2024
eeccaca
Upgrade protobuf definitions to flightsql 17.0 (#6133)
djanderson Jul 27, 2024
b07d057
Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata`
adriangb Jul 24, 2024
e2be8d3
fix loading in test by etseidl
adriangb Jul 31, 2024
0175d53
add rough equivalence test
etseidl Jul 31, 2024
f188bf8
one more check
etseidl Jul 31, 2024
57b85d7
make clippy happy
etseidl Jul 31, 2024
1f3eb0b
Merge pull request #1 from etseidl/pr_6000_ets
adriangb Jul 31, 2024
4d1651c
separate tests that require arrow into a separate module
etseidl Jul 31, 2024
8691903
Merge remote-tracking branch 'origin/master' into test_merge5
etseidl Aug 1, 2024
241ee02
add histograms to to_thrift()
etseidl Aug 1, 2024
0b53d55
Merge pull request #2 from etseidl/fix_compile_check
adriangb Aug 5, 2024
4d7158f
Merge pull request #3 from etseidl/fix_checks_and_merge
adriangb Aug 5, 2024
590c4ed
Merge remote-tracking branch 'apache/master' into add-encode_metadata
alamb Aug 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions parquet/src/file/page_index/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,53 @@ impl<T: ParquetValueType> NativeIndex<T> {
boundary_order: index.boundary_order,
})
}

pub(crate) fn to_thrift(&self) -> ColumnIndex {
let min_values = self
.indexes
.iter()
.map(|x| x.min_bytes().map(|x| x.to_vec()))
.collect::<Option<Vec<_>>>()
.unwrap_or_else(|| vec![vec![]; self.indexes.len()]);

let max_values = self
.indexes
.iter()
.map(|x| x.max_bytes().map(|x| x.to_vec()))
.collect::<Option<Vec<_>>>()
.unwrap_or_else(|| vec![vec![]; self.indexes.len()]);

let null_counts = self
.indexes
.iter()
.map(|x| x.null_count())
.collect::<Option<Vec<_>>>();

// Concatenate page histograms into a single Option<Vec>
let repetition_level_histograms = self
.indexes
.iter()
.map(|x| x.repetition_level_histogram().map(|v| v.values()))
.collect::<Option<Vec<&[i64]>>>()
.map(|hists| hists.concat());

let definition_level_histograms = self
.indexes
.iter()
.map(|x| x.definition_level_histogram().map(|v| v.values()))
.collect::<Option<Vec<&[i64]>>>()
.map(|hists| hists.concat());

ColumnIndex::new(
self.indexes.iter().map(|x| x.min().is_none()).collect(),
min_values,
max_values,
self.boundary_order,
null_counts,
repetition_level_histograms,
definition_level_histograms,
)
}
}

#[cfg(test)]
Expand Down
Loading
Loading