Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for nullable bool, int in dataframes #504

Closed
ivirshup opened this issue Feb 9, 2021 · 10 comments
Closed

Support for nullable bool, int in dataframes #504

ivirshup opened this issue Feb 9, 2021 · 10 comments

Comments

@ivirshup
Copy link
Member

ivirshup commented Feb 9, 2021

What needs to happen

Support for nullable dtypes during IO. Allow for writing pandas string, integer, and boolean arrays (which can have null values) by saving a "null" mask along with them.

Example

import anndata as ad, pandas as pd, numpy as np

a = ad.AnnData(np.ones((3, 3)))

# Works fine
a.obs["np_bool"] = np.zeros(3, dtype=bool)
a.write("tmp.h5ad")

# Errors at write
a.obs["pd_bool"] = a.obs["np_bool"].astype(pd.BooleanDtype())
a.write("tmp.h5ad")
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.
Full traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
    290     else:
--> 291         group[key] = series.values
    292 

/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
    410             else:
--> 411                 ds = self.create_dataset(None, data=obj)
    412                 h5o.link(ds.id, self.id, name, lcpl=lcpl)

/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    147 
--> 148             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    149             dset = dataset.Dataset(dsid)

/usr/local/lib/python3.8/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
     88             dtype = numpy.dtype(dtype)
---> 89         tid = h5t.py_create(dtype, logical=1)
     90 

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
    264     for col_name, (_, series) in zip(col_names, df.items()):
--> 265         write_series(group, col_name, series, dataset_kwargs=dataset_kwargs)
    266 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-32812d0f937a> in <module>
      1 a.obs["pd_bool"] = a.obs["np_bool"].astype(pd.BooleanDtype())
----> 2 a.write("tmp.h5ad")

~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1877             filename = self.filename
   1878 
-> 1879         _write_h5ad(
   1880             Path(filename),
   1881             self,

~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
    109         else:
    110             write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 111         write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
    112         write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    113         write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)

/usr/local/Cellar/[email protected]/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/functools.py in wrapper(*args, **kw)
    873                             '1 positional argument')
    874 
--> 875         return dispatch(args[0].__class__)(*args, **kw)
    876 
    877     funcname = getattr(func, '__name__', 'singledispatch function')

~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
    130     if key in f:
    131         del f[key]
--> 132     _write_method(type(value))(f, key, value, *args, **kwargs)
    133 
    134 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    210         except Exception as e:
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"
    214                 f"Above error raised while writing key {key!r} of {type(elem)}"

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.

I have a report from the wild of writing working here, but reading (by cellxgene) failing.

@ivirshup ivirshup added the bug label Feb 9, 2021
@ivirshup
Copy link
Member Author

ivirshup commented Feb 9, 2021

Mildly complicated by: h5py using a special enumerated dtype for numpy bool (possibly since numpy does not pack bool arrays to bit arrays). BooleanDtype support nullable values. Possible solutions:

  • We don't support nullable boolean values, just try to convert BooleanDtype to np.bool.
  • Define our own hdf5 enum for pandas booleans (might need to do this for zarr as well)
  • Do the julia thing, store an indicator bit array on the side for dtypes without canonical "missing" values.
    • This may have to be a boolean array since numpy doesn't do bit arrays.

I like the third option since it's backend neutral, and doesn't require doing anything fancy with hdf5 or zarr.

I suspect this issue will come up for nullable integer types as well. Maybe strings?

@grst
Copy link
Contributor

grst commented Mar 31, 2021

Option 3 would basically be a masked array?

@ivirshup
Copy link
Member Author

Structurally, I think so, but concepts like assignment differ. I don't think we'd actually use that module.

@ivirshup ivirshup changed the title Error writing pandas BooleanDtype Error writing pandas BooleanDtype (support for nullable bool, int, str) Nov 8, 2021
@ivirshup ivirshup changed the title Error writing pandas BooleanDtype (support for nullable bool, int, str) Support for nullable bool, int, str columns in dataframes Nov 9, 2021
@ivirshup ivirshup modified the milestones: 0.8, 0.7.7 Nov 9, 2021
@vitkl
Copy link

vitkl commented Nov 17, 2021

I am wondering what's the progress on this issue. It is very annoying when analysis results don't get saved after several hours of work on HPC because a new column popped up with unsave-able object type (in a script that worked just fine the other day, e.g. no need to test for save-ability). So I would really appreciate if this is addressed.

Maybe you can do a temporary workaround that converts such objects to strings with a warning?

@julie-jch
Copy link

Hello,
I have 2 adata files, I'm trying to concatenate, using the join='outer' parameter. When I try to write/save my adata I get the following error:

`TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'mt-0' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'var' of <class 'h5py._hl.files.File'> from /.`

When I concatenate without this join parameter I can write/save with no problems.
Is there any fix or workaround for this?

@vitkl
Copy link

vitkl commented Dec 14, 2021

I generally convert all problematic variables to strings obs['x'].astype(str)

@julie-jch
Copy link

Thank you, I'll try this.

@ivirshup
Copy link
Member Author

I am wondering what's the progress on this issue.

Specifically for nullable values, there should be a release candidate out before the holidays.

It is very annoying when analysis results don't get saved after several hours of work on HPC because a new column popped up with unsave-able object type

I think this will be a possibility for the foreseeable future. It's just the nature of arrays interacting with pythons object system.

@ivirshup
Copy link
Member Author

I'm going to make strings it's own issue, and close this since the ints and bools are now supported.

@ivirshup ivirshup changed the title Support for nullable bool, int, str columns in dataframes Support for nullable bool, int in dataframes Jan 11, 2022
@brianpenghe
Copy link

I generally convert all problematic variables to strings obs['x'].astype(str)

Thanks for the suggestion. In fact, var also needs to be converted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants