Skip to content

Commit

Permalink
Parquet uses row group row count if missing from header (#13712)
Browse files Browse the repository at this point in the history
When investigating [this issue](#13664) I noticed that the file provided has 0 rows in the header. This caused cudf's parquet reader to fail at reading the file, but other tools such as `parq` and `parquet-tools` had no issues reading the file. This change counts up the number of rows in the row groups of the file and will complain loudly if the number differ, but not if the main header is 0. This allows us to properly read the data inside this file. Note that it will not properly parse it as a list of structs yet, that will be fixed in another PR. I didn't add a test since this is the only file I have seen with this issue and we can't read it yet in cudf. A test will be added for reading this file, which will test this change as well, with the PR for that issue.

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13712
  • Loading branch information
hyperbolic2346 authored Jul 18, 2023
1 parent 494535e commit 9fe1270
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion cpp/src/io/parquet/reader_impl_helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,13 @@ int64_t aggregate_reader_metadata::calc_num_rows() const
{
return std::accumulate(
per_file_metadata.begin(), per_file_metadata.end(), 0l, [](auto& sum, auto& pfm) {
return sum + pfm.num_rows;
auto const rowgroup_rows = std::accumulate(
pfm.row_groups.begin(), pfm.row_groups.end(), 0l, [](auto& rg_sum, auto& rg) {
return rg_sum + rg.num_rows;
});
CUDF_EXPECTS(pfm.num_rows == 0 || pfm.num_rows == rowgroup_rows,
"Header and row groups disagree about number of rows in file!");
return sum + (pfm.num_rows == 0 && rowgroup_rows > 0 ? rowgroup_rows : pfm.num_rows);
});
}

Expand Down

0 comments on commit 9fe1270

Please sign in to comment.