Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: __dataframe__ interchange protocol for anndata #1111

Open
ivirshup opened this issue Aug 29, 2023 · 2 comments
Open

Idea: __dataframe__ interchange protocol for anndata #1111

ivirshup opened this issue Aug 29, 2023 · 2 comments

Comments

@ivirshup
Copy link
Member

ivirshup commented Aug 29, 2023

Please describe your wishes and possible alternatives to achieve the desired result.

https://data-apis.org/dataframe-protocol/latest/index.html

It could be nice if AnnData supported the __dataframe__ interchange protocol, especially when used by libraries which will use the select_columns_by_name, get_column_by_name interfaces.

Use-case: plotting

The biggest use case is plotting. Both seaborn (mwaskom/seaborn#3369) and altair (vega/altair#2888) now support inputs in the dataframe protocol.

In scanpy we typically use the sc.get.obs_df method to create a dataframe for plotting. A major painpoint for this in analysis code is that the user has to provide the keys they want to plot multiple times, once for creating the dataframe, and again to the plotting interface. Instead of having to do:

sns.jointplot(
    data=sc.get.obs_df(adata, ["log1p_total_counts", "pct_counts_mito", "batch"]),
    x="log1p_total_counts",
    y="pct_counts_mito",
    hue="batch",
)

It could eventually be:

sns.jointplot(
    data=adata,  # Likely something more like `DFInterface(adata, dim="obs", layer=...)` for now
    x="log1p_total_counts",
    y="pct_counts_mito",
    hue="batch",
)

This should also work for plots of gene expression values, especially if the underlying plotting library selects columns through the dataframe interface and the matrix was stored as CSC or dense.

This could even be a nice interface to on-disk data, especially when X/ layers is stored in CSC.

Some more detail

  • For dataframe interface for observations, available columns are a union of .obs.columns, var_names, keys like obsm/pca/0.
  • We should be able to pick an alias for var_names
  • We should be able to choose which layer is being accessed

Implementation

I think it would make sense for this to start out as POC outside of the main implementation. It may require pyarrow as a dependency to work. In theory pyarrow be a dependency of pandas v3 early next year, so may not be an issue.

cc: @ilan-gold

@ivirshup
Copy link
Member Author

ivirshup commented Sep 5, 2023

Very rough proof of concept:

import pandas as pd
from pandas.core.interchange.column import PandasColumn
from pandas.core.interchange.dataframe import PandasDataFrameXchg

import anndata as ad
import scanpy as sc

class ObsDF(pd.core.interchange.dataframe_protocol.DataFrame):
    def __init__(self, adata: ad.AnnData, layer: str | None = None, allow_copy: bool = True):
        self.adata = adata
        self.layer = layer
        self.allow_copy = allow_copy

    def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True):
        return ObsDF(self.adata, self.layer, allow_copy=allow_copy)

    @property
    def metadata(self) -> dict[str, pd.Index]:
        # `index` isn't a regular column, and the protocol doesn't support row
        # labels - so we export it as Pandas-specific metadata here.
        return {"pandas.index": self.adata.obs_names}

    def get_chunks(self, n_chunks=None):
        if n_chunks and n_chunks > 1:
            size = len(self._df)
            step = size // n_chunks
            if size % n_chunks != 0:
                step += 1
            for start in range(0, step * n_chunks, step):
                yield ObsDf(
                    self.adata[start : start + step, :],
                    layer=self.layer,
                    allow_copy=self.allow_copy,
                )
        else:
            yield self

    def get_columns(self):
        raise NotImplementedError()

    def column_names(self):
        return list(adata.obs.columns) + list(adata.var_names)

    def num_chunks(self):
        return 1

    def get_column_by_name(self, name: str):
        return PandasColumn(pd.Series(self.adata.obs_vector(name, layer=self.layer), index=self.adata.obs_names))

    def get_column(self, i: int):
        return self.get_column_by_name(self.column_names()[i])

    def num_columns(self) -> int:
        return len(self.column_names())

    def num_rows(self) -> int:
        return self.adata.n_obs

    def select_columns_by_name(self, names: list[str]):
        return PandasDataFrameXchg(sc.get.obs_df(self.adata, names, layer=self.layer))

    def select_columns(self, indices):
        all_names = self.column_names()
        return self.select_columns_by_name([all_names[i] for i in indices])

Looks like altair/ data fusion currently don't support the protocol well enough for us to be able to use them.

@ivirshup
Copy link
Member Author

ivirshup commented Sep 5, 2023

Sadly, looks like the same for seaborn. Just uses the interchange to convert whatever type you pass to a pandas dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant