Reading Parquet int96 timestamps into arrow timestamps overflows #1527

jaychia · 2023-08-04T22:30:36Z

I am running into an issue where reading Parquet int96 timestamps into arrow2 timestamp[ns] arrays can potentially overflow silently, providing wrong results.

This issue was also noted in pyarrow/arrow-cpp ARROW-12096.

Here is a quick example:

First, write a Parquet file with int96 timestamps, where some timestamps are out of range for the timestamp[ns] type:

import pyarrow as pa
import pyarrow.parquet as papq
import datetime

# Use PyArrow to write Parquet files with int96 timestamps
table = pa.Table.from_pydict({
    "timestamps": pa.array([
        datetime.datetime(1000, 1, 1),
        datetime.datetime(2000, 1, 1),
        datetime.datetime(3000, 1, 1),
    ], pa.timestamp("ms"))
})
papq.write_table(table, "timestamps.parquet", use_deprecated_int96_timestamps=True, store_schema=False)

Reading this file in a unit test results in an overflow panic:

#[test]
fn read_int96_timestamps() -> Result<()> {
    let timestamp_data = &[
        0x50, 0x41, 0x52, 0x31, 0x15, 0x04, 0x15, 0x48, 0x15, 0x3c, 0x4c, 0x15, 0x06, 0x15, 0x00,
        0x12, 0x00, 0x00, 0x24, 0x00, 0x00, 0x0d, 0x01, 0x08, 0x9f, 0xd5, 0x1f, 0x0d, 0x0a, 0x44,
        0x00, 0x00, 0x59, 0x68, 0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14,
        0xfb, 0x2a, 0x00, 0x15, 0x00, 0x15, 0x14, 0x15, 0x18, 0x2c, 0x15, 0x06, 0x15, 0x10, 0x15,
        0x06, 0x15, 0x06, 0x1c, 0x00, 0x00, 0x00, 0x0a, 0x24, 0x02, 0x00, 0x00, 0x00, 0x06, 0x01,
        0x02, 0x03, 0x24, 0x00, 0x26, 0x9e, 0x01, 0x1c, 0x15, 0x06, 0x19, 0x35, 0x10, 0x00, 0x06,
        0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61, 0x6d, 0x70, 0x73, 0x15, 0x02,
        0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26, 0x60, 0x26, 0x08, 0x29, 0x2c, 0x15,
        0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15, 0x10, 0x15, 0x02, 0x00, 0x00, 0x00,
        0x15, 0x04, 0x19, 0x2c, 0x35, 0x00, 0x18, 0x06, 0x73, 0x63, 0x68, 0x65, 0x6d, 0x61, 0x15,
        0x02, 0x00, 0x15, 0x06, 0x25, 0x02, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61,
        0x6d, 0x70, 0x73, 0x00, 0x16, 0x06, 0x19, 0x1c, 0x19, 0x1c, 0x26, 0x9e, 0x01, 0x1c, 0x15,
        0x06, 0x19, 0x35, 0x10, 0x00, 0x06, 0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74,
        0x61, 0x6d, 0x70, 0x73, 0x15, 0x02, 0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26,
        0x60, 0x26, 0x08, 0x29, 0x2c, 0x15, 0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15,
        0x10, 0x15, 0x02, 0x00, 0x00, 0x00, 0x16, 0x9e, 0x01, 0x16, 0x06, 0x26, 0x08, 0x16, 0x96, 
        0x01, 0x14, 0x00, 0x00, 0x28, 0x20, 0x70, 0x61, 0x72, 0x71, 0x75, 0x65, 0x74, 0x2d, 0x63,
        0x70, 0x70, 0x2d, 0x61, 0x72, 0x72, 0x6f, 0x77, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f,
        0x6e, 0x20, 0x31, 0x32, 0x2e, 0x30, 0x2e, 0x30, 0x19, 0x1c, 0x1c, 0x00, 0x00, 0x00, 0x95,
        0x00, 0x00, 0x00, 0x50, 0x41, 0x52, 0x31
    ];
    let mut reader = Cursor::new(timestamp_data);

    let metadata = read_metadata(&mut reader)?;
    let schema = infer_schema(&metadata)?;
    let reader = FileReader::new(reader, metadata.row_groups, schema, Some(5), None, None);

    let x = reader.collect::<Result<Vec<_>>>().unwrap();
    println!("{:?}", x);
    Ok(())
}

---- io::parquet::read::read_int96_timestamps stdout ----
thread 'io::parquet::read::read_int96_timestamps' panicked at 'attempt to multiply with overflow', /Users/jaychia/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.17.2/src/types.rs:112:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Solution

I would like to propose a two part solution:

When reading int96 data, we should take into account the requested TimeUnit instead of always defaulting to Nanoseconds (see PR: Correctly coerce Parquet Int96 timestamps into requested TimeUnits #1532 )
When inferring a schema from Parquet, we should allow clients to pass in ParquetSchemaInferenceOptions, which will allow users to specify how they want arrow2 to infer the Arrow types for Parquet Int96 types. (see PR: Add SchemaInferenceOptions options to infer_schema and option to configure int96 inference #1533 )

arrow2::io::parquet::read::schema::infer_schema_with_options(..., options: ParquetSchemaInferenceOptions)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading Parquet int96 timestamps into arrow timestamps overflows #1527

Reading Parquet int96 timestamps into arrow timestamps overflows #1527

jaychia commented Aug 4, 2023 •

edited

Loading

Reading Parquet int96 timestamps into arrow timestamps overflows #1527

Reading Parquet int96 timestamps into arrow timestamps overflows #1527

Comments

jaychia commented Aug 4, 2023 • edited Loading

Solution

jaychia commented Aug 4, 2023 •

edited

Loading