Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address issue with __dataframe_namespace__ and non-compliant dataframes #169

Closed
rgommers opened this issue May 11, 2023 · 13 comments · Fixed by #172
Closed

address issue with __dataframe_namespace__ and non-compliant dataframes #169

rgommers opened this issue May 11, 2023 · 13 comments · Fixed by #172
Assignees
Labels
API design bug Something isn't working

Comments

@rgommers
Copy link
Member

The __dataframe_namespace__ method is currently modeled on the corresponding method in the array API. However, we're in a slightly different situation here, with a separate developer-facing API. There is a need to convert to/from the DataFrame and namespace from a non-standard-compliant user facing dataframe object. We already have the reverse, but this bit is missing - see #156 (comment).

Probably the pattern should be:

pdx, df_compliant = df.__dataframe_namespace__()
# use API standard ....
return df_compliant.dataframe

when going from public API in a df-using library to the standard-using internals. Inside the internals there should then be only usage of the standard-compliant dataframe ideally.

@rgommers rgommers added bug Something isn't working API design labels May 11, 2023
@rgommers rgommers self-assigned this May 11, 2023
@jbrockmendel
Copy link
Contributor

Inside the internals there should then be only usage of the standard-compliant dataframe ideally.

Can you expand on this? This sounds like the exact opposite of what I expect to be actually implemented in the near term.

@rgommers
Copy link
Member Author

Hmm, I'm not sure why? I think looking at @MarcoGorelli's Seaborn POC (MarcoGorelli/seaborn#1) shows that internally, the API standard methods are used. You kinda have to do that, otherwise how is it going to work with multiple dataframe libraries?

So it's something like:

# in a library that consumes dataframes:

def is_dataframe_api_obj(x):
    return hasattr(x, '__dataframe_namespace__')

def func1(df):
    if is_dataframe_api_obj(df):
        pdx, dfx = df.__dataframe_namespace__()
    else:
        # do whatever the library did pre-standard here if we can't convert, or raise?
        # a shim library a la array_api_compat should help to ensure we don't
        # really need a second code path here for, say pandas 1.5.0

    return _func2(dfx).dataframe   # same df type as input

def _func2(dfx):
    "internal function, use only API standard methods on dfx"
    ....
    return dfx_out

@jbrockmendel
Copy link
Contributor

The tail end of the OP reads to me like it makes a difference whether pd.DataFrame is built on top of pd.consortium.DataFrame or the other way around. I don't think it should make a difference for seaborn.

@rgommers
Copy link
Member Author

The tail end of the OP reads to me like it makes a difference whether pd.DataFrame is built on top of pd.consortium.DataFrame or the other way around.

A few thoughts:

  • my rough sketch above is the long-term picture. In the nearer term, I think there's a separate standalone pure Python package that pandas does not need to know about, and that provides a compliant implementation and the required to/from pd.DataFrame helpers
  • long-term, either one works, it doesn't really matter. I assume that won't be two separate full implementations of underlying functionality, they're simply two Python-level APIs, with the new standards-based one being library author focused
  • given that the non-standard df's already exist, I expect that that will remain the primary (compiled-code-carrying etc.) implementation

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented May 11, 2023

I think all you need is something like

df_compliant = df.__dataframe_standard__()

Then, if you want the namespace, you can do

namespace = df_compliant.__dataframe_namespace__()

and then use it as

namespace.concat([df_compliant, other_df_compliant])

EDIT: the suggestion in the original post seems fine too, I just think this might be a bit cleaner than returning a tuple. Both are fine though
EDIT2: actually, one issue with non-compliant having __dataframe_namespace__ is that it prevents __dataframe_namespace__ from being used if you want to check whether a DataFrame is compliant or not

@MarcoGorelli
Copy link
Contributor

Any thoughts on the above?

I'd suggest:

  • non-compliant DataFrames (e.g. pandas.DataFrame, polars.DataFrame, other_lib.DataFrame, ...) have __dataframe_standard__, which returns a compliant DataFrame
  • compliant DataFrames have __dataframe_namespace, which returns the namespace and can be used to check if they are compliant

@jbrockmendel
Copy link
Contributor

dataframe_standard->dataframe_consortium

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented May 17, 2023

let's go further - how about __consortium_api__?

naming aside, OK for separating between __consortium_api__ and __dataframe_namespace__, and have the former just return a Standard-compliant DataFrame, and the latter just return the Standard's namespace?

EDIT: having slept on it, I do like __dataframe_consortium__

@rgommers
Copy link
Member Author

EDIT: the suggestion in the original post seems fine too, I just think this might be a bit cleaner than returning a tuple.

Agreed.

EDIT2: actually, one issue with non-compliant having __dataframe_namespace__ is that it prevents __dataframe_namespace__ from being used if you want to check whether a DataFrame is compliant or not

I agree with this too - and that is already documented indeed: https://data-apis.org/dataframe-api/draft/purpose_and_scope.html#checking-a-dataframe-object-for-compliance.

EDIT: having slept on it, I do like __dataframe_consortium__

I think that reads a little strangely. It's the name of an organization, and a pretty long/cumbersome name at that (I list my affiliation as data-apis.org sometimes, because the full name is too long). The semantics of this are "give me the standard-compliant version of this dataframe", so I like the original __dataframe_standard__ much better, or otherwise perhaps something like __dataframe_api_compliant__ - some name that hints at what it is/does, rather than the org that helped design it.

@MarcoGorelli
Copy link
Contributor

I like the original dataframe_standard much better

Sure, shall we go back to that then?

@jbrockmendel what's your objection to dataframe_standard?

I don't really mind, hardly anyone would see this method anyway, just hoping we can agree on something and document it

@jbrockmendel
Copy link
Contributor

"standard" implies everything else is non-standard, which has normative connotations.

@rgommers
Copy link
Member Author

The whole spec is called dataframe API standard though? It's just what the thing is? I don't understand the reasoning, but do you like __dataframe_api_compliant__ better?

@jbrockmendel
Copy link
Contributor

never mind, i dont care that much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants