Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Reading Parquet int96 timestamps into arrow timestamps overflows #1527

Open
jaychia opened this issue Aug 4, 2023 · 0 comments
Open

Reading Parquet int96 timestamps into arrow timestamps overflows #1527

jaychia opened this issue Aug 4, 2023 · 0 comments

Comments

@jaychia
Copy link
Contributor

jaychia commented Aug 4, 2023

I am running into an issue where reading Parquet int96 timestamps into arrow2 timestamp[ns] arrays can potentially overflow silently, providing wrong results.

This issue was also noted in pyarrow/arrow-cpp ARROW-12096.


Here is a quick example:

First, write a Parquet file with int96 timestamps, where some timestamps are out of range for the timestamp[ns] type:

import pyarrow as pa
import pyarrow.parquet as papq
import datetime

# Use PyArrow to write Parquet files with int96 timestamps
table = pa.Table.from_pydict({
    "timestamps": pa.array([
        datetime.datetime(1000, 1, 1),
        datetime.datetime(2000, 1, 1),
        datetime.datetime(3000, 1, 1),
    ], pa.timestamp("ms"))
})
papq.write_table(table, "timestamps.parquet", use_deprecated_int96_timestamps=True, store_schema=False)

Reading this file in a unit test results in an overflow panic:

#[test]
fn read_int96_timestamps() -> Result<()> {
    let timestamp_data = &[
        0x50, 0x41, 0x52, 0x31, 0x15, 0x04, 0x15, 0x48, 0x15, 0x3c, 0x4c, 0x15, 0x06, 0x15, 0x00,
        0x12, 0x00, 0x00, 0x24, 0x00, 0x00, 0x0d, 0x01, 0x08, 0x9f, 0xd5, 0x1f, 0x0d, 0x0a, 0x44,
        0x00, 0x00, 0x59, 0x68, 0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14,
        0xfb, 0x2a, 0x00, 0x15, 0x00, 0x15, 0x14, 0x15, 0x18, 0x2c, 0x15, 0x06, 0x15, 0x10, 0x15,
        0x06, 0x15, 0x06, 0x1c, 0x00, 0x00, 0x00, 0x0a, 0x24, 0x02, 0x00, 0x00, 0x00, 0x06, 0x01,
        0x02, 0x03, 0x24, 0x00, 0x26, 0x9e, 0x01, 0x1c, 0x15, 0x06, 0x19, 0x35, 0x10, 0x00, 0x06,
        0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61, 0x6d, 0x70, 0x73, 0x15, 0x02,
        0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26, 0x60, 0x26, 0x08, 0x29, 0x2c, 0x15,
        0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15, 0x10, 0x15, 0x02, 0x00, 0x00, 0x00,
        0x15, 0x04, 0x19, 0x2c, 0x35, 0x00, 0x18, 0x06, 0x73, 0x63, 0x68, 0x65, 0x6d, 0x61, 0x15,
        0x02, 0x00, 0x15, 0x06, 0x25, 0x02, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61,
        0x6d, 0x70, 0x73, 0x00, 0x16, 0x06, 0x19, 0x1c, 0x19, 0x1c, 0x26, 0x9e, 0x01, 0x1c, 0x15,
        0x06, 0x19, 0x35, 0x10, 0x00, 0x06, 0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74,
        0x61, 0x6d, 0x70, 0x73, 0x15, 0x02, 0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26,
        0x60, 0x26, 0x08, 0x29, 0x2c, 0x15, 0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15,
        0x10, 0x15, 0x02, 0x00, 0x00, 0x00, 0x16, 0x9e, 0x01, 0x16, 0x06, 0x26, 0x08, 0x16, 0x96, 
        0x01, 0x14, 0x00, 0x00, 0x28, 0x20, 0x70, 0x61, 0x72, 0x71, 0x75, 0x65, 0x74, 0x2d, 0x63,
        0x70, 0x70, 0x2d, 0x61, 0x72, 0x72, 0x6f, 0x77, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f,
        0x6e, 0x20, 0x31, 0x32, 0x2e, 0x30, 0x2e, 0x30, 0x19, 0x1c, 0x1c, 0x00, 0x00, 0x00, 0x95,
        0x00, 0x00, 0x00, 0x50, 0x41, 0x52, 0x31
    ];
    let mut reader = Cursor::new(timestamp_data);

    let metadata = read_metadata(&mut reader)?;
    let schema = infer_schema(&metadata)?;
    let reader = FileReader::new(reader, metadata.row_groups, schema, Some(5), None, None);

    let x = reader.collect::<Result<Vec<_>>>().unwrap();
    println!("{:?}", x);
    Ok(())
}
---- io::parquet::read::read_int96_timestamps stdout ----
thread 'io::parquet::read::read_int96_timestamps' panicked at 'attempt to multiply with overflow', /Users/jaychia/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.17.2/src/types.rs:112:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Solution

I would like to propose a two part solution:

  1. When reading int96 data, we should take into account the requested TimeUnit instead of always defaulting to Nanoseconds (see PR: Correctly coerce Parquet Int96 timestamps into requested TimeUnits #1532 )
  2. When inferring a schema from Parquet, we should allow clients to pass in ParquetSchemaInferenceOptions, which will allow users to specify how they want arrow2 to infer the Arrow types for Parquet Int96 types. (see PR: Add SchemaInferenceOptions options to infer_schema and option to configure int96 inference #1533 )
arrow2::io::parquet::read::schema::infer_schema_with_options(..., options: ParquetSchemaInferenceOptions)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant