Idea: `dataframe` interchange protocol for anndata #1111

ivirshup · 2023-08-29T12:11:35Z

Please describe your wishes and possible alternatives to achieve the desired result.

https://data-apis.org/dataframe-protocol/latest/index.html

It could be nice if AnnData supported the __dataframe__ interchange protocol, especially when used by libraries which will use the select_columns_by_name, get_column_by_name interfaces.

Use-case: plotting

The biggest use case is plotting. Both seaborn (mwaskom/seaborn#3369) and altair (vega/altair#2888) now support inputs in the dataframe protocol.

In scanpy we typically use the sc.get.obs_df method to create a dataframe for plotting. A major painpoint for this in analysis code is that the user has to provide the keys they want to plot multiple times, once for creating the dataframe, and again to the plotting interface. Instead of having to do:

sns.jointplot(
    data=sc.get.obs_df(adata, ["log1p_total_counts", "pct_counts_mito", "batch"]),
    x="log1p_total_counts",
    y="pct_counts_mito",
    hue="batch",
)

It could eventually be:

sns.jointplot(
    data=adata,  # Likely something more like `DFInterface(adata, dim="obs", layer=...)` for now
    x="log1p_total_counts",
    y="pct_counts_mito",
    hue="batch",
)

This should also work for plots of gene expression values, especially if the underlying plotting library selects columns through the dataframe interface and the matrix was stored as CSC or dense.

This could even be a nice interface to on-disk data, especially when X/ layers is stored in CSC.

Some more detail

For dataframe interface for observations, available columns are a union of .obs.columns, var_names, keys like obsm/pca/0.
We should be able to pick an alias for var_names
We should be able to choose which layer is being accessed

Implementation

I think it would make sense for this to start out as POC outside of the main implementation. It may require pyarrow as a dependency to work. In theory pyarrow be a dependency of pandas v3 early next year, so may not be an issue.

cc: @ilan-gold

The text was updated successfully, but these errors were encountered:

ivirshup · 2023-09-05T22:41:30Z

Very rough proof of concept:

import pandas as pd
from pandas.core.interchange.column import PandasColumn
from pandas.core.interchange.dataframe import PandasDataFrameXchg

import anndata as ad
import scanpy as sc

class ObsDF(pd.core.interchange.dataframe_protocol.DataFrame):
    def __init__(self, adata: ad.AnnData, layer: str | None = None, allow_copy: bool = True):
        self.adata = adata
        self.layer = layer
        self.allow_copy = allow_copy

    def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True):
        return ObsDF(self.adata, self.layer, allow_copy=allow_copy)

    @property
    def metadata(self) -> dict[str, pd.Index]:
        # `index` isn't a regular column, and the protocol doesn't support row
        # labels - so we export it as Pandas-specific metadata here.
        return {"pandas.index": self.adata.obs_names}

    def get_chunks(self, n_chunks=None):
        if n_chunks and n_chunks > 1:
            size = len(self._df)
            step = size // n_chunks
            if size % n_chunks != 0:
                step += 1
            for start in range(0, step * n_chunks, step):
                yield ObsDf(
                    self.adata[start : start + step, :],
                    layer=self.layer,
                    allow_copy=self.allow_copy,
                )
        else:
            yield self

    def get_columns(self):
        raise NotImplementedError()

    def column_names(self):
        return list(adata.obs.columns) + list(adata.var_names)

    def num_chunks(self):
        return 1

    def get_column_by_name(self, name: str):
        return PandasColumn(pd.Series(self.adata.obs_vector(name, layer=self.layer), index=self.adata.obs_names))

    def get_column(self, i: int):
        return self.get_column_by_name(self.column_names()[i])

    def num_columns(self) -> int:
        return len(self.column_names())

    def num_rows(self) -> int:
        return self.adata.n_obs

    def select_columns_by_name(self, names: list[str]):
        return PandasDataFrameXchg(sc.get.obs_df(self.adata, names, layer=self.layer))

    def select_columns(self, indices):
        all_names = self.column_names()
        return self.select_columns_by_name([all_names[i] for i in indices])

Looks like altair/ data fusion currently don't support the protocol well enough for us to be able to use them.

Push down column selections when using __dataframe__ protocol vega/vegafusion#386

ivirshup · 2023-09-05T22:48:31Z

Sadly, looks like the same for seaborn. Just uses the interchange to convert whatever type you pass to a pandas dataframe.

ivirshup added enhancement type: dataframe 🧮 labels Aug 29, 2023

ivirshup mentioned this issue Sep 7, 2023

Datashader as plotting backend scverse/scanpy#2656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: `dataframe` interchange protocol for anndata #1111

Idea: `dataframe` interchange protocol for anndata #1111

ivirshup commented Aug 29, 2023 •

edited

Loading

ivirshup commented Sep 5, 2023

ivirshup commented Sep 5, 2023

Idea: __dataframe__ interchange protocol for anndata #1111

Idea: __dataframe__ interchange protocol for anndata #1111

Comments

ivirshup commented Aug 29, 2023 • edited Loading

Please describe your wishes and possible alternatives to achieve the desired result.

Use-case: plotting

Some more detail

Implementation

ivirshup commented Sep 5, 2023

ivirshup commented Sep 5, 2023

Idea: `dataframe` interchange protocol for anndata #1111

Idea: `dataframe` interchange protocol for anndata #1111

ivirshup commented Aug 29, 2023 •

edited

Loading