Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

alamb · 2023-12-13T19:48:29Z

Describe the bug

There is a problem where the statistics for the wrong columns can be used to prune parquet row groups, leading to potentially incorrect answers (missing data)

It affects cases where there are predicates that can be used to prune row groups. This used to work

To Reproduce

I am working on a self contained reproducer

Here is what I found in IOx: for the file that is being pruned incorrectly, the state column is a string with the value MA but the max value from the statistics that is returned is as follows (it turns out it is the statistics for the time column):

+-------+
| state |
+-------+
| 50    |
+-------+

I dug into the issue some more, and I found that It appears to be a bug in how the column index names are looked up by the parquet_column function (that we sadly added to handle another bug 🤦 )

Here is the arrow schema:

Schema {
  fields: [
    Field { name: "area", data_type: UInt64, nullable: true },
    Field { name: "city", data_type: Dictionary(Int32, Utf8) },
    Field { name: "max_temp", data_type: Float64, nullable: true },
    Field { name: "min_temp", data_type: Float64, nullable: true },
    Field { name: "no_overlap", data_type: Float64, nullable: true },
    Field { name: "state", data_type: Dictionary(Int32, Utf8), nullable: true },  <--- index 5
    Field { name: "time", data_type: Timestamp(Nanosecond, None), nullable: false },
  ]
}

Note the state column is index 5

The code in parquet_column resolves column 5 to index 5 in the parquet schema

The paruqet_file in question has this schema

GroupType { basic_info: BasicTypeInfo { name: "arrow_schema", repetition: None, converted_type: NONE, logical_type: None, id: None },
  fields: [
    PrimitiveType { basic_info: BasicTypeInfo { name: "area", repetition: Some(OPTIONAL), converted_type: UINT_64, logical_type: Some(Integer { bit_width: 64, is_signed: false }), id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 },
    PrimitiveType { basic_info: BasicTypeInfo { name: "city", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 },
    PrimitiveType { basic_info: BasicTypeInfo { name: "max_temp", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 },
    PrimitiveType { basic_info: BasicTypeInfo { name: "min_temp", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 },
    PrimitiveType { basic_info: BasicTypeInfo { name: "no_overlap", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 },
    PrimitiveType { basic_info: BasicTypeInfo { name: "state", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 },
    PrimitiveType { basic_info: BasicTypeInfo { name: "time", repetition: Some(REQUIRED), converted_type: NONE, logical_type: Some(Timestamp { is_adjusted_to_u_t_c: false, unit: NANOS(NanoSeconds) }), id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }
 ]
}

Note that state is actually at offset 6 in this file, which leads to the statistics being incorrectly read and therefore pruned.

Expected behavior

The correct statistics should be used

Additional context

Kudos to @appletreeisyellow for helping track down this issue.

This was introduced in #8294 / 06bbe12 by myself and @tustvold

At the core, the problem is that previously parquet statistics were looked up by name using the parquet schema. We added a lookup on arrow schema first to make sure the right parquet field was found, but we used the wrong arrow scheam.

@tustvold pointed out that this is incorrect for nested types (#8335), and thus we fixed the problem by doing the lookup by name in the arrow schema, and then mapping to the parquet schema by column index and checking that it was not part of a nested type.

However, this mapping failed to account for the fact that each individual parquet file may have a different schema than the overall table (and ParquetExec handles this mapping)

The text was updated successfully, but these errors were encountered:

alamb · 2023-12-13T21:58:29Z

I have a PR with a proposed fix ready for review: #8533

alamb added the bug Something isn't working label Dec 13, 2023

alamb self-assigned this Dec 13, 2023

alamb changed the title ~~Incorrect results when reading parquet files with different schemas and statistics~~ Regression: Incorrect results when reading parquet files with different schemas and statistics Dec 13, 2023

alamb mentioned this issue Dec 13, 2023

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

Merged

andygrove closed this as completed in #8533 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

alamb commented Dec 13, 2023 •

edited

Loading

alamb commented Dec 13, 2023

Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

Comments

alamb commented Dec 13, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Dec 13, 2023

alamb commented Dec 13, 2023 •

edited

Loading