-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecordBatch conversion from pyarrow loses Schema's metadata #5354
Comments
This looks to have been introduced by #5070 perhaps @kylebarron you might be able to take a look. Also tagging @pitrou for some pyarrow knowledge |
That's correct, I think #5070 introduced it, by transforming from a |
Sorry for that. Indeed, by default Schema equality doesn't check the metadata. You would have to change the code to (see https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.equals) |
Happy to change the unit tests to that syntax if you prefer |
IMHO that would be a good idea, since it would make the tests stricter :-) |
Interestingly, changing the tests (which I've done locally) raise another error:
It looks like the round-trip adds two fields metadata on the |
The source of that additional metadata seems to come from arrow's code, which is called by pyarrow to export the array and schema. pyarrow calls arrow's |
That is similar to how things are done for Arrow IPC. Knowledge of the extension type is serialized as metadata fields, and then extracted from the metadata on deserialization. |
... it still means that A few solutions come to mind:
|
Hmm, actually, perhaps it should fixed on the Arrow C++ side. I'll take a look. |
Ok, I opened apache/arrow#39865 for Arrow C++. |
I believe I also hit this recently but hadn't debugged where the conversion was failing. Thanks for the issue and PR! I've also hit related pyarrow issues around extension types and schema metadata. Not sure what the right solution is to those (personally I wish the schema metadata were always accessible on the |
|
Describe the bug
When importing a pyarrow RecordBatch, we instantiate a
StructArray
, then convert it to aRecordBatch
.This loses the metadata from the original schema. This is not seen in the current tests, because pyarrow's
tests for equality on schemas ignores metadata.
arrow-rs/arrow/src/pyarrow.rs
Lines 357 to 358 in 31cf5ce
To Reproduce
Expected behavior
Imported RecordBatch's schema should keep its metadata.
The text was updated successfully, but these errors were encountered: