Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Incorrect nullability inferred for nested parquet schema #1556

Closed
jhorstmann opened this issue Aug 30, 2023 · 2 comments
Closed

Incorrect nullability inferred for nested parquet schema #1556

jhorstmann opened this issue Aug 30, 2023 · 2 comments

Comments

@jhorstmann
Copy link
Contributor

I'm trying to read a parquet file that contains a struct inside a list using pola-rs and am getting null values for each element. I think I can track down the issue to the schema conversion from parquet to arrow.

The parquet_to_arrow_schema function tries to set the nullable flag of Field according to the parquet repetition levels. That flag is then used via the InitNested enum to calculate the level at which data is valid.

My message schema looks like the following:

message eventlog {
  REQUIRED group events (LIST) {
    REPEATED group array {
      REQUIRED BYTE_ARRAY event_name (STRING);
      REQUIRED INT64 event_time (TIMESTAMP(MILLIS,true));
    }
  }
}

And I would expect all fields having the is_nullable flag set to false. Instead the array field is marked as nullable. I think the issue can also be shown with the example schemas from parquet-format/LogicalTypes.md which are tested in test_parquet_lists. The comments there do not match the assertions. For example:

        // // List<String> (list nullable, elements non-null)
        // optional group my_list (LIST) {
        //   repeated group element {
        //     required binary str (UTF8);
        //   };
        // }
        {
            arrow_fields.push(Field::new(
                "my_list",
                DataType::List(Box::new(Field::new("element", DataType::Utf8, true))),
                true,
            ));
        }

        // // List<Integer> (nullable list, non-null elements)
        // optional group my_list (LIST) {
        //   repeated int32 element;
        // }
        {
            arrow_fields.push(Field::new(
                "my_list",
                DataType::List(Box::new(Field::new("element", DataType::Int32, true))),
                true,
            ));
        }

According to the comment and documentation element should not be nullable in both examples.

I do not yet have a standalone test case and example file, but will try to provide one later.

@jhorstmann
Copy link
Contributor Author

Sample file: eventlog.zip, generated with the java implementation from parquet-mr.

The parquet-read tool from arrow-rs reads this without problems:

{case_id: "12345678", events: [[{event_name: "A", event_time: 1970-01-01 00:00:00 +00:00}, {event_name: "B", event_time: 1970-01-01 00:00:00 +00:00}, {event_name: "C", event_time: 1970-01-01 00:00:00 +00:00}]]}

parquet_read from arrow2 with an added dbg!(&chunk); gives this output:

Statistics {
    null_count: UInt64[0],
    distinct_count: UInt64[None],
    min_value: Utf8Array[12345678],
    max_value: Utf8Array[12345678],
}
Statistics {
    null_count: ListArray[[{event_name: 0, event_time: 0}]],
    distinct_count: ListArray[[{event_name: None, event_time: None}]],
    min_value: ListArray[[{event_name: A, event_time: 1970-01-01 00:00:00.001 +00:00}]],
    max_value: ListArray[[{event_name: C, event_time: 1970-01-01 00:00:00.003 +00:00}]],
}
[examples/parquet_read.rs:45] &chunk = Chunk {
    arrays: [
        Utf8Array[12345678],
        ListArray[[None, None, None]],
    ],
}

jhorstmann added a commit to jhorstmann/arrow2 that referenced this issue Sep 7, 2023
…to arrow

This allows the `parquet_read` example to correctly read the nested data
attached to issue jorgecarleitao#1556 and also makes several test assertions match the
comments above.
@jhorstmann
Copy link
Contributor Author

Fixed by #1565

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant