Support for nullable bool, int in dataframes #504

ivirshup · 2021-02-09T02:17:12Z

What needs to happen

Support for nullable dtypes during IO. Allow for writing pandas string, integer, and boolean arrays (which can have null values) by saving a "null" mask along with them.

Example

import anndata as ad, pandas as pd, numpy as np

a = ad.AnnData(np.ones((3, 3)))

# Works fine
a.obs["np_bool"] = np.zeros(3, dtype=bool)
a.write("tmp.h5ad")

# Errors at write
a.obs["pd_bool"] = a.obs["np_bool"].astype(pd.BooleanDtype())
a.write("tmp.h5ad")

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.

Full traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
    290     else:
--> 291         group[key] = series.values
    292 

/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
    410             else:
--> 411                 ds = self.create_dataset(None, data=obj)
    412                 h5o.link(ds.id, self.id, name, lcpl=lcpl)

/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    147 
--> 148             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    149             dset = dataset.Dataset(dsid)

/usr/local/lib/python3.8/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
     88             dtype = numpy.dtype(dtype)
---> 89         tid = h5t.py_create(dtype, logical=1)
     90 

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
    264     for col_name, (_, series) in zip(col_names, df.items()):
--> 265         write_series(group, col_name, series, dataset_kwargs=dataset_kwargs)
    266 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-32812d0f937a> in <module>
      1 a.obs["pd_bool"] = a.obs["np_bool"].astype(pd.BooleanDtype())
----> 2 a.write("tmp.h5ad")

~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1877             filename = self.filename
   1878 
-> 1879         _write_h5ad(
   1880             Path(filename),
   1881             self,

~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
    109         else:
    110             write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 111         write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
    112         write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    113         write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)

/usr/local/Cellar/[email protected]/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/functools.py in wrapper(*args, **kw)
    873                             '1 positional argument')
    874 
--> 875         return dispatch(args[0].__class__)(*args, **kw)
    876 
    877     funcname = getattr(func, '__name__', 'singledispatch function')

~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
    130     if key in f:
    131         del f[key]
--> 132     _write_method(type(value))(f, key, value, *args, **kwargs)
    133 
    134 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    210         except Exception as e:
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"
    214                 f"Above error raised while writing key {key!r} of {type(elem)}"

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.

I have a report from the wild of writing working here, but reading (by cellxgene) failing.

ivirshup · 2021-02-09T02:29:31Z

Mildly complicated by: h5py using a special enumerated dtype for numpy bool (possibly since numpy does not pack bool arrays to bit arrays). BooleanDtype support nullable values. Possible solutions:

We don't support nullable boolean values, just try to convert BooleanDtype to np.bool.
Define our own hdf5 enum for pandas booleans (might need to do this for zarr as well)
Do the julia thing, store an indicator bit array on the side for dtypes without canonical "missing" values.
- This may have to be a boolean array since numpy doesn't do bit arrays.

I like the third option since it's backend neutral, and doesn't require doing anything fancy with hdf5 or zarr.

I suspect this issue will come up for nullable integer types as well. Maybe strings?

grst · 2021-03-31T12:23:57Z

Option 3 would basically be a masked array?

ivirshup · 2021-03-31T12:37:48Z

Structurally, I think so, but concepts like assignment differ. I don't think we'd actually use that module.

vitkl · 2021-11-17T10:18:00Z

I am wondering what's the progress on this issue. It is very annoying when analysis results don't get saved after several hours of work on HPC because a new column popped up with unsave-able object type (in a script that worked just fine the other day, e.g. no need to test for save-ability). So I would really appreciate if this is addressed.

Maybe you can do a temporary workaround that converts such objects to strings with a warning?

julie-jch · 2021-12-14T14:56:28Z

Hello,
I have 2 adata files, I'm trying to concatenate, using the join='outer' parameter. When I try to write/save my adata I get the following error:

`TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'mt-0' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'var' of <class 'h5py._hl.files.File'> from /.`

When I concatenate without this join parameter I can write/save with no problems.
Is there any fix or workaround for this?

vitkl · 2021-12-14T16:13:44Z

I generally convert all problematic variables to strings obs['x'].astype(str)

julie-jch · 2021-12-14T16:35:39Z

Thank you, I'll try this.

ivirshup · 2021-12-14T19:15:15Z

I am wondering what's the progress on this issue.

Specifically for nullable values, there should be a release candidate out before the holidays.

It is very annoying when analysis results don't get saved after several hours of work on HPC because a new column popped up with unsave-able object type

I think this will be a possibility for the foreseeable future. It's just the nature of arrays interacting with pythons object system.

ivirshup · 2022-01-11T20:55:39Z

I'm going to make strings it's own issue, and close this since the ints and bools are now supported.

brianpenghe · 2022-07-12T20:13:03Z

I generally convert all problematic variables to strings obs['x'].astype(str)

Thanks for the suggestion. In fact, var also needs to be converted.

ivirshup added the bug label Feb 9, 2021

ivirshup mentioned this issue Mar 31, 2021

Conversion to categorical makes None "None" #141

Closed

grst mentioned this issue Apr 27, 2021

Upgrading from 0.7.5 to 0.7.6 throws TypeError when saving h5ad #558

Closed

ivirshup changed the title ~~Error writing pandas BooleanDtype~~ Error writing pandas BooleanDtype (support for nullable bool, int, str) Nov 8, 2021

ivirshup added enhancement topic: io and removed bug labels Nov 9, 2021

ivirshup changed the title ~~Error writing pandas BooleanDtype (support for nullable bool, int, str)~~ Support for nullable bool, int, str columns in dataframes Nov 9, 2021

ivirshup modified the milestones: 0.8, 0.7.7 Nov 9, 2021

This was referenced Nov 9, 2021

Add panel metadata to AnnData export BodenmillerGroup/steinbock#66

Closed

TypeError: Object dtype dtype('O') has no native HDF5 equivalent #636

Open

ivirshup mentioned this issue Dec 10, 2021

Meta issue: new data types #662

Open

8 tasks

ivirshup mentioned this issue Dec 25, 2021

Support for nullable ints & bools #669

Merged

2 tasks

ivirshup mentioned this issue Jan 11, 2022

.write does not save None values #673

Open

ivirshup changed the title ~~Support for nullable bool, int, str columns in dataframes~~ Support for nullable bool, int in dataframes Jan 11, 2022

ivirshup closed this as completed Jan 11, 2022

ivirshup mentioned this issue Jan 11, 2022

Nullable string columns #679

Closed

grst mentioned this issue Jan 12, 2022

Consistent typing in adata.obs scverse/scirpy#190

Closed

This was referenced Jul 23, 2023

Can't save h5mu from Scirpy processed gex+bcr+tcr data if I copy airr into obs scverse/scirpy#434

Open

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for nullable bool, int in dataframes #504

Support for nullable bool, int in dataframes #504

ivirshup commented Feb 9, 2021 •

edited

Loading

ivirshup commented Feb 9, 2021

grst commented Mar 31, 2021

ivirshup commented Mar 31, 2021

vitkl commented Nov 17, 2021 •

edited

Loading

julie-jch commented Dec 14, 2021

vitkl commented Dec 14, 2021

julie-jch commented Dec 14, 2021

ivirshup commented Dec 14, 2021

ivirshup commented Jan 11, 2022

brianpenghe commented Jul 12, 2022

Support for nullable bool, int in dataframes #504

Support for nullable bool, int in dataframes #504

Comments

ivirshup commented Feb 9, 2021 • edited Loading

What needs to happen

Example

ivirshup commented Feb 9, 2021

grst commented Mar 31, 2021

ivirshup commented Mar 31, 2021

vitkl commented Nov 17, 2021 • edited Loading

julie-jch commented Dec 14, 2021

vitkl commented Dec 14, 2021

julie-jch commented Dec 14, 2021

ivirshup commented Dec 14, 2021

ivirshup commented Jan 11, 2022

brianpenghe commented Jul 12, 2022

ivirshup commented Feb 9, 2021 •

edited

Loading

vitkl commented Nov 17, 2021 •

edited

Loading