Extension type support #2444

Renkai · 2022-08-14T13:48:55Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

My friends wrote a data format lance that has interconvertibility with parquet, and I want to make another implementation with Rust.
However, they used EXTENTION type,
it seems has not been implemented in arrow-rs.

Describe the solution you'd like

Let the reader can convert the parquet file with EXTENTION type to Arrow.
Describe alternatives you've considered

Additional context

Python code I used to generate such a file.

# with pyarrow-9.0.0
import pyarrow as pa


class UuidType(pa.PyExtensionType):

    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.binary(16))

    def __reduce__(self):
        return UuidType, ()


if __name__ == '__main__':
    uuid_type = UuidType()
    print(uuid_type.extension_name)
    print(uuid_type.storage_type)
    import uuid

    storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
    arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
    print(arr)
    table = pa.Table.from_arrays([arr], names=["uuid"])
    import pyarrow.parquet as pq

    pq.write_table(table, "extension_example.parquet")
   
    # successfully read and print
    parquet_table = pq.read_table('extension_example.parquet')
    print("schema", parquet_table.schema)
    print("table", parquet_table)

Rust code that failed reading

        let input_file_name = "extension_example.parquet";
        //https://docs.rs/parquet/19.0.0/parquet/arrow/index.html
        use arrow::record_batch::RecordBatchReader;
        use parquet::arrow::{ParquetFileArrowReader, ArrowReader, ProjectionMask};
        use std::fs::File;

        let file = File::open(input_file_name).unwrap();

        let mut arrow_reader = ParquetFileArrowReader::try_new(file).unwrap();
        let mask = ProjectionMask::leaves(arrow_reader.parquet_schema(), [0]);
        println!("parquet schema is: {:?}", arrow_reader.parquet_schema());
        println!("Converted arrow schema is: {}", arrow_reader.get_schema().unwrap());

error log

thread 'tests::test_convert' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError("Unable to get root as message stored in ARROW:schema: Utf8Error { error: Utf8Error { valid_up_to: 0, error_len: Some(1) }, range: 216..255, error_trace: ErrorTrace([TableField { field_name: \"value\", position: 208 }, VectorElement { index: 0, position: 116 }, TableField { field_name: \"custom_metadata\", position: 92 }, VectorElement { index: 0, position: 48 }, TableField { field_name: \"fields\", position: 40 }, UnionVariant { variant: \"MessageHeader::Schema\", position: 24 }, TableField { field_name: \"header\", position: 24 }]) }")', src/lib.rs:39:77

The text was updated successfully, but these errors were encountered:

tustvold · 2022-08-14T15:58:22Z

Thank you for the example, I can reproduce this. The parquet file is produced with the following base64 encoded arrow schema

"/////zABAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAYAAAAAAASABgACAAGAAcADAAAABAAFAASAAAAAAABDxQAAADUAAAACAAAABQAAAAAAAAABAAAAHV1aWQAAAAAAgAAAFQAAAAEAAAAvP///wgAAAAgAAAAFAAAAEFSUk9XOmV4dGVuc2lvbjpuYW1lAAAAABcAAABhcnJvdy5weV9leHRlbnNpb25fdHlwZQAIAAwABAAIAAgAAAAIAAAAJAAAABgAAABBUlJPVzpleHRlbnNpb246bWV0YWRhdGEAAAAAJwAAAIAElRwAAAAAAAAAjAhfX21haW5fX5SMCFV1aWRUeXBllJOUKVKULgAAAAYACAAEAAYAAAAQAAAA"

I took this and converted it to its raw data with

echo "..." | base64 --decode > /tmp/output.bin

I trimmed the first 8 bytes (as they're arrow specific delimiter information)

tail -c +9 /tmp/output.bin > /tmp/output-trimmed.bin

I then decoded the flatbuffer

flatc --json --raw-binary format/Message.fbs -- /tmp/output-trimmed.bin

And got the error

  Unable to generate text for output-trimmed

This fits with the error that the rust implementation is returning, the schema has non-UTF-8 data encoded in a string field, which is technically illegal.

If I tell flatc to ignore this

flatc  --allow-non-utf8 --json --raw-binary format/Message.fbs -- /tmp/output-trimmed.bin

I get the decoded data

$ cat output-trimmed.json 
{
  version: "V5",
  header_type: "Schema",
  header: {
    fields: [
      {
        name: "uuid",
        nullable: true,
        type_type: "FixedSizeBinary",
        type: {
          byteWidth: 16
        },
        children: [

        ],
        custom_metadata: [
          {
            key: "ARROW:extension:metadata",
            value: "\x80\u0004\x95\u001C\u0000\u0000\u0000\u0000\u0000\u0000\u0000\x8C\b__main__\x94\x8C\bUuidType\x94\x93\x94)R\x94."
          },
          {
            key: "ARROW:extension:name",
            value: "arrow.py_extension_type"
          }
        ]
      }
    ]
  }
}

I'm not really sure what to make of this, KeyValue is defined as

table KeyValue {
  key: string;
  value: string;
}

Which as per the flatbuffer spec must be valid UTF-8. I will try to get some clarity on what is going on here - my understanding of the specification is the Rust implementation is correct to refuse this schema...

tustvold · 2022-08-15T08:51:49Z

Ok so it would appear that this is a known issue where pyarrow is writing ill-formed flatbuffers (here) for extension types. There isn't really much we can do here, a flatbuffer string field should not contain non-UTF-8 data, and in the case of Rust permitting this would not be sound (it could lead to UB). Having spoken with @jorgecarleitao I'm led to believe arrow2 also takes the approach of rejecting this.

The proper solution to the problem is for pyarrow to either base64 encode the payloads, or for the arrow specification to change KeyValue.value to be bytes not string. Both are probably going to be difficult to sell...

That being said, the embedded metadata is a pickled python class, which likely isn't hugely useful to a rust client anyway. Perhaps you could use skip_arrow_metadata to tell the parquet reader to just ignore the malformed embedded arrow schema, and just infer the data from the underlying parquet schema?

Renkai · 2022-08-16T15:01:24Z

@tustvold Thanks a lot!

I replaced the generator with this one, it basically changed the pa.PyExtensionType to pa.ExtensionType. The rust parquet parser works well now except it would read the data type as FixedSizeBinary(16). I think it's a slight difference in behavior from the C++ parser. For practice, I can continue my adventure, but would you consider making the community less divergent?

import pyarrow as pa


class UuidType(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, pa.binary(16),"lance.uuid")

    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return UuidType()


if __name__ == '__main__':
    uuid_type = UuidType()
    print(uuid_type.extension_name)
    print(uuid_type.storage_type)
    import uuid

    storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
    arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
    print(arr)
    table = pa.Table.from_arrays([arr], names=["uuid"])
    import pyarrow.parquet as pq

    pq.write_table(table, "extension_example.parquet")

tustvold · 2022-08-16T17:02:08Z

I think it's a slight difference in behavior from the C++ parser

What is the behavior of the C++ parser?

Would you consider making the community less divergent?

Happy to be corrected on this, but I don't believe there is a standard for extension types, one could conceivably make the case that such a concept would be an oxymoron... If there is a general purpose data type that is missing from the standard set, I'm sure the community would be willing to consider additions to the arrow specification, and this would be the path to portability for that data type.

There is always going to be a trade-off between expressiveness and portability, with extension types sacrificing the latter in favour of the former. I'm not sure there is a way around this... FWIW just supporting the standard arrow types is ripe with excitement #1666

tustvold · 2023-06-01T15:00:38Z

Closing this as I don't believe it is tracking any missing functionality, feel free to reopen if I am mistaken

Renkai added the enhancement Any new improvement worthy of a entry in the changelog label Aug 14, 2022

jorgecarleitao mentioned this issue Feb 20, 2023

Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176

Closed

alamb added the parquet Changes to the parquet crate label Apr 28, 2023

jiacai2050 mentioned this issue May 31, 2023

pass bytes directly to parquet's KeyValue #4317

Closed

tustvold closed this as completed Jun 1, 2023

wjones127 mentioned this issue Jun 30, 2023

Add an ExtensionType to DataType enum #4472

Open

tustvold mentioned this issue Mar 24, 2024

Flatbuffers Utf8 error when reading PyArrow written feather file #5547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension type support #2444

Extension type support #2444

Renkai commented Aug 14, 2022

tustvold commented Aug 14, 2022 •

edited

Loading

tustvold commented Aug 15, 2022 •

edited

Loading

Renkai commented Aug 16, 2022

tustvold commented Aug 16, 2022

tustvold commented Jun 1, 2023

Extension type support #2444

Extension type support #2444

Comments

Renkai commented Aug 14, 2022

tustvold commented Aug 14, 2022 • edited Loading

tustvold commented Aug 15, 2022 • edited Loading

Renkai commented Aug 16, 2022

tustvold commented Aug 16, 2022

tustvold commented Jun 1, 2023

tustvold commented Aug 14, 2022 •

edited

Loading

tustvold commented Aug 15, 2022 •

edited

Loading