Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatch conversion from pyarrow loses Schema's metadata #5354

Closed
atwam opened this issue Jan 31, 2024 · 13 comments · Fixed by #5355
Closed

RecordBatch conversion from pyarrow loses Schema's metadata #5354

atwam opened this issue Jan 31, 2024 · 13 comments · Fixed by #5355
Labels
arrow Changes to the arrow crate bug

Comments

@atwam
Copy link
Contributor

atwam commented Jan 31, 2024

Describe the bug

When importing a pyarrow RecordBatch, we instantiate a StructArray, then convert it to a RecordBatch.
This loses the metadata from the original schema. This is not seen in the current tests, because pyarrow's
tests for equality on schemas ignores metadata.

let array = StructArray::from(array_data);
return Ok(array.into());

To Reproduce

# In arrow-pyarrow-integration-testing/tests/test_sql.py:470
batch = pa.record_batch([f32_array], ["tensor"], metadata={b'key1': b'value1'})
    b = rust.round_trip_record_batch(batch)
    assert b == batch  # OK
    assert b.schema == batch.schema  # OK
    assert b.schema.metadata == batch.schema.metadata  # Fails

Expected behavior
Imported RecordBatch's schema should keep its metadata.

@tustvold
Copy link
Contributor

This looks to have been introduced by #5070 perhaps @kylebarron you might be able to take a look. Also tagging @pitrou for some pyarrow knowledge

@atwam
Copy link
Contributor Author

atwam commented Jan 31, 2024

That's correct, I think #5070 introduced it, by transforming from a StructArray in the capsule case. The code for non-capsule conversion will work fine, because it keeps the schema around.
On the pyarrow side, I was surprised that a.schema == b.schema would work when fields match, but metadata is different (which is how tests were green).

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

On the pyarrow side, I was surprised that a.schema == b.schema would work when fields match, but metadata is different (which is how tests were green).

Sorry for that. Indeed, by default Schema equality doesn't check the metadata. You would have to change the code to a.equals(b, check_metadata=True).

(see https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.equals)

@atwam
Copy link
Contributor Author

atwam commented Jan 31, 2024

Sorry for that. Indeed, by default Schema equality doesn't check the metadata. You would have to change the code to a.equals(b, check_metadata=True).

Happy to change the unit tests to that syntax if you prefer

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

IMHO that would be a good idea, since it would make the tests stricter :-)

@atwam
Copy link
Contributor Author

atwam commented Jan 31, 2024

Interestingly, changing the tests (which I've done locally) raise another error:
in test_sql.py:test_tensor_array, schemas are not equal even after my fixes:

> print("Orig schema:", batch.schema)
Orig schema: tensor: extension<arrow.fixed_shape_tensor[value_type=float, shape=[2,3]]>
-- schema metadata --
key1: 'value1'

> print("New schema:", b.schema)
New schema: tensor: extension<arrow.fixed_shape_tensor[value_type=float, shape=[2,3]]>
  -- field metadata --
  ARROW:extension:name: 'arrow.fixed_shape_tensor'
  ARROW:extension:metadata: '{"shape":[2,3]}'
-- schema metadata --
key1: 'value1'

It looks like the round-trip adds two fields metadata on the tensor field, that were not there in the original pa.record_batch

@atwam
Copy link
Contributor Author

atwam commented Jan 31, 2024

The source of that additional metadata seems to come from arrow's code, which is called by pyarrow to export the array and schema. pyarrow calls arrow's ExportArray, which itself exports the schema and adds metadata to the field to handle its extension type.

https://github.com/apache/arrow/blob/787afa1594586d2d556d21471647f9cd2c55b18f/cpp/src/arrow/c/bridge.cc#L295-L304

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

That is similar to how things are done for Arrow IPC. Knowledge of the extension type is serialized as metadata fields, and then extracted from the metadata on deserialization.

@atwam
Copy link
Contributor Author

atwam commented Jan 31, 2024

... it still means that schema.equals fails when there are extension fields, because the original pyarrow schema does not have it, yet exports it to the rust object, and it is still present when it comes back from its round-trip.

A few solutions come to mind:

  • Either we just test that the schema metadata is the same, and ignore fields metadata.
  • Or we allow the returned schema to be different, but ensure that any metadata in the original schema/fields comes back unharmed (and we allow for new metadata keys)

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

Hmm, actually, perhaps it should fixed on the Arrow C++ side. I'll take a look.

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

Ok, I opened apache/arrow#39865 for Arrow C++.

@kylebarron
Copy link
Contributor

I believe I also hit this recently but hadn't debugged where the conversion was failing. Thanks for the issue and PR!

I've also hit related pyarrow issues around extension types and schema metadata. Not sure what the right solution is to those (personally I wish the schema metadata were always accessible on the Field even when an extension type is registered, because it's possible that a third party has registered an extension and you don't know what attributes are guaranteed to exist on that class).

@tustvold tustvold added the arrow Changes to the arrow crate label Mar 1, 2024
@tustvold
Copy link
Contributor

tustvold commented Mar 1, 2024

label_issue.py automatically added labels {'arrow'} from #5355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants