Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BUG] Register super extension on to_arrow (#3030)
This is an issue where Daft Extension types were not getting converted to PyArrow properly. @jaychia discovered this while trying to write parquet with a tensor column, where the Extension metadata for tensor was getting dropped. A simple test to reproduce the error: ``` import daft import numpy as np from daft import Series # Create sample tensor data with some null values tensor_data = [np.array([[1, 2], [3, 4]]), None, None] # Uncomment this and it will work # from daft.datatype import _ensure_registered_super_ext_type # _ensure_registered_super_ext_type() df_original = daft.from_pydict({"tensor_col": Series.from_pylist(tensor_data)}) print(df_original.to_arrow().schema) ``` Output: ``` tensor_col: struct<data: large_list<item: int64>, shape: large_list<item: uint64>> child 0, data: large_list<item: int64> child 0, item: int64 child 1, shape: large_list<item: uint64> child 0, item: uint64 ``` It's not a tensor type! However if you uncomment the `_ensure_registered_super_ext_type()`, you will now see: ``` tensor_col: extension<daft.super_extension<DaftExtension>> ``` The issue here is that the `class DaftExtension(pa.ExtensionType):` is not imported during the FFI, as it is now a lazy import that must be called via `_ensure_registered_super_ext_type()`. This PR adds calls to this import in `to_arrow` for series and schema. However, I do not know if this is exhaustive, and I will give this more thought. @desmondcheongzx @samster25 --------- Co-authored-by: Colin Ho <[email protected]>
- Loading branch information