Add a summary document for the dataframe interchange protocol #30

rgommers · 2020-09-15T16:32:36Z

Summarizes the various discussions about and goals/non-goals and requirements for the __dataframe__ data interchange protocol.

The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style.

The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements.

Closes gh-29, xref gh-25.

@datapythonista

Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29

jorisvandenbossche · 2020-09-15T16:57:52Z

protocol/dataframe_protocol_summary.md

+The limitations seem to be:
+- No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
+- Specific to columnar data (_at least, this is what its docs say_).
+  TODO: are there any concerns for, e.g., Koalas or Ibis.


I am not sure what this limitation points at. Isn't columnar data the scope of this proposal? (above it says "treat dataframes as a collection of columns")

Yes true, maybe the TODO belongs more with that first statement, or it's a non-issue. I put it here because the Arrow doc is so adamant about only dealing with columnar data, so I thought about it when writing this section.

I think this all still works fine for a row-based dataframe, it's just a more expensive conversion?

In my head, this whole document is about columnar conversions? But maybe that is a wrong assumption? (in any case something to discuss then in the meeting tomorrow)

For example, also the "Possible direction for implementation" (the comment of @kkraus14 in the other issue) is about a columnar exchange.
(and indeed, your dataframe doesn't necessarily need to be implemented columnar to support a column-based exchange protocol, it might only be less efficient).

Other exchange types, like row-based, might also be interesting, but I think they will be sufficiently different that they warrant a separate discussion / document.

In my head, this whole document is about columnar conversions?

I think you're right, it's just a matter of being careful about terminology. It's about columnar conversions, and treating dataframes as a collection of columns from the user's point of view, but I think we should be careful to avoid implying that the implementation must use columnar storage, or that a column is an array.

protocol/dataframe_protocol_summary.md

kkraus14 · 2020-09-15T21:30:19Z

protocol/dataframe_protocol_summary.md

+What we are aiming for is quite similar to the Arrow C Data Interface (see
+the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
+except `__dataframe__` is a Python-level rather than C-level interface.


One key thing is Arrow C Data interface relies on providing a deletion / finalization method similar to DLPack. That is something that hasn't been discussed too much, but we should iron out for this proposal.

Hmm, that's a good one. Since we're discussing a Python-level API, a deletion / finalization method seems a bit "foreign" / C-specific. But I agree that it's important. I have to admit I haven't fully figured out what all the observable behaviour differences to a Python user are between the deleter method and refcounting - should write a set of tests for that.

Another question here: is "Python-level" a design requirement? (if so, probably should be added in the list above)
For example also DLPack, considered as the exchange protocol format for the array standard, is a C-level interface?

(to be clear, I am not very familiar with those aspects. It might also be you can have a Python interface to a C-level exchange format?)

Added a TODO for desired semantics (deletion/finalization vs. buffer protocol type behaviour) here.

Another question here: is "Python-level" a design requirement? (if so, probably should be added in the list above)

I'd say yes, added to item 1 of the list of design requirements. A C-only interface would probably be asking too much from consumers here, and maximal performance doesn't seem too important compared to having this functionality available in the first place.

After looking into the semantics some more, I came to the conclusion that that Arrow spec doesn't define any particular semantics that matter to Python users. The release callback semantics matter a lot to library implementers (both producer and consumer), but in the end it doesn't say anything about whether memory is shared or not. It allows for zero-copy but doesn't mandate it - so the copy/view + mutation ambiguity is similar as we had the large discussion around for arrays.

protocol/dataframe_protocol_summary.md

jorisvandenbossche

Thanks for writing this up! That is going to be useful as a basis for further discussions ;)

Is it within the scope of the first iteration to be able to select specific columns to exchange/convert?
Eg a plotting library now working with pandas (eg plot("x", "y", data=df) might not need to convert to full dataframe-like to a pandas DataFrame, but only specific columns to plot).
For example in the prototype of wesm/dataframe-protocol#1 this is possible (due to the lazy "proxy" object being returned that allows to only convert a specific column)

protocol/dataframe_protocol_summary.md

jorisvandenbossche · 2020-09-16T09:37:05Z

protocol/dataframe_protocol_summary.md

+   but doesn't Pandas use object dtype to represent strings?_).
+2. Heterogeneous/structured dtypes within a single column does not need to be
+   supported.
+   _Rationale: not used a lot, additional design complexity not justified._


The Arrow C data interface supports this through Arrow struct type (not exactly the same as numpy's structured dtype, though).

But we probably need to say something in general about nested types. Because you mention "heterogenous", but you could eg also have a homogeneous list type ("ragged" array like).

Would anyone have a use for that, and even if that is the case - would there be enough demand that multiple dataframe libraries would implement that in a way that it would work the same for a user?

I think just extending the out of scope to be more clear that nested dtypes, custom/extension dtypes, etc. are all out of scope. And in the in scope part, be more explicit about dtypes that are included (I made a summary of the discussions we had, so I didn't say things we all seem to agree on and are obvious, like "integer, floating point, .... dtypes are in scope").

I think there would certainly be use cases for it (eg cudf already supports nested types, in pandas we are considering adding a list dtype, and I suppose eg koalas (since it is spark-based) supports nested types as well). But I also think it is certainly fine to leave it out for a first iteration.

(leveraging the Arrow type definitions / interface would basically give it for free)

I had a look at what Arrow does here. Nested types and a list dtype seem to me to be very similar to object arrays, they're basically a variable-size container that's very flexible, you can stuff pretty much anything into it.

(leveraging the Arrow type definitions / interface would basically give it for free)

Free as in "we have a spec already", but it seems complex implementation-wise.

But I also think it is certainly fine to leave it out for a first iteration.

I'd be inclined to do that, and just clarify the statement on this topic a bit.

Edited, and added extension dtypes here too - that's in the same bucket I'd say. Possible to add in the future, a bridge too far for the first version.

Nested types and a list dtype seem to me to be very similar to object arrays

I am not sure that is an accurate description (at least, on a memory layout level, but assuming we are talking about that).
Object arrays are arrays with pointers to python objects that can live anywhere in memory (AFAIK?), while nested types in arrow consist of several contiguous arrays (eg a list array is one plain array with the actual values, and one array with the indices where each list starts).

Also eg a struct type array consists of one array per key of the struct, which is also not that complex implementation wise, I would say.

Object arrays are arrays with pointers to python objects that can live anywhere in memory (AFAIK?)

Yes indeed.

jorisvandenbossche · 2020-09-16T09:39:36Z

protocol/dataframe_protocol_summary.md

+What we are aiming for is quite similar to the Arrow C Data Interface (see
+the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
+except `__dataframe__` is a Python-level rather than C-level interface.


Another question here: is "Python-level" a design requirement? (if so, probably should be added in the list above)
For example also DLPack, considered as the exchange protocol format for the array standard, is a C-level interface?

(to be clear, I am not very familiar with those aspects. It might also be you can have a Python interface to a C-level exchange format?)

protocol/dataframe_protocol_summary.md

jorisvandenbossche · 2020-09-16T09:48:47Z

protocol/dataframe_protocol_summary.md

+Instead, it should be dataframe consumers that rely on NumPy or Arrow, since
+they are the ones that need such a particular format. So, it can call the
+constructor it needs. For example, `x = np.asarray(df['colname'])` (where
+`df` supports `__dataframe__`).


This ignores an important aspect brought up in the discussion, I think. One of the arguments to have such dedicated methods is that you might need a numpy array in a specific memory layout (eg because your cython algo requires it). Numpy's type support is less rich as found in dataframe libraries (eg no categorical, string, decimal, ..), and numpy doesn't support missing values. So as the consumer you might want to have control on how the conversion is done.

For example in pandas, the to_numpy method has a na_value keyword to control what value (compatible with the numpy dtype) is used for missing values.

This of course doesn't necessarily require a to_numpy method, as we might be able to give this kind of control in other ways. But I think having this kind of control is an important use case (at least for compatibility with numpy-based (or general array-based) libraries).

That's a good point. At the moment na_value is the one thing I see in Pandas. Is there anything else that other dataframe libraries have or you know may be needed in the future?

Also the actual dtype, I think. For example, if you have a categorical column, do you want to get a "densified" array (eg string array of the categories are string) or the integer indices? If you have a string column, do you want to get a numpy str dtype or object dtype array? If you have a decimal column, do you want a numpy float array or object array with decimal objects. Etc.
Basically any data type that has no direct mapping to a numpy dtype might have potentially multiple options in how to convert it to numpy.

Could we avoid this discussion by solving this in the future? e.g.

x = np.array(df['col'].fill_na(4)) y = np.array(df['col'].as_type(pdx.types.float32))

This suggestion makes sense I think @maartenbreddels. Longer-term this does seem like the cleaner solution.

This is actually a bit of a side discussion, triggered by the presence of to_numpy and to_arrow in wesm/dataframe-protocol#1. In that prototype it's actually not connected to __dataframe__. And converting one dataframe into another kind of dataframe is a different goal/conversation than "convert a column into an array". We do not have a design requirement for the latter. So I'd say we clarify that in the doc so that this FAQ item has some more context, and then just leave it out.

And converting one dataframe into another kind of dataframe is a different goal/conversation than "convert a column into an array"

But often, converting one dataframe into another dataframe will go through converting each column of dataframe 1 to an array, and assemble dataframe 2 from those arrays ?
At least, that's how I imagine dataframe exchange to work based on the protocol we are discussing (if you want to avoid specific knowledge about the type of dataframe 1).

Yes, that's fair enough - that's completely implicit right now, should probably be made more explicit, that will both make this section make more sense and help when implementing. I'll retract my "this is a different goal" comment and will try to fix this up.

rgommers · 2020-09-16T22:26:39Z

Is it within the scope of the first iteration to be able to select specific columns to exchange/convert?
Eg a plotting library now working with pandas (eg plot("x", "y", data=df) might not need to convert to full dataframe-like to a pandas DataFrame, but only specific columns to plot).

I would say yes, I forgot that part. There was (I believe) already agreement on being able to extract columns, and minimal dataframe metadata like number of rows/columns and column names. And it's helpful; if some code knows it's only going to use one or two columns, it would like to construct a dataframe with only those columns rather than converting a complete (possibly very large) dataframe into the format it uses internally.

This topic didn't feature anymore in the recent discussions, so I forgot to list it. Will update.

jorisvandenbossche · 2020-09-30T20:46:26Z

Circling back to some of the discussions we had on the last call. One of the (many) aspects we discussed was about the specific memory layout, and the potential need to allow different memory layout instead of standardizing around a single memory layout.

The typical example here is missing value support: there are several ways missing values can be represented (sentinel, boolean mask, bit mask), and conversions between them can be costly.

But are there actually other examples of such cases where different memory layouts might be needed?

maartenbreddels · 2020-10-01T16:56:23Z

Good point Joris, and more concretely, is there a case that cannot be represented using Arrow?

jorisvandenbossche · 2020-10-01T17:08:38Z

Yes, that is somewhat the reason I am asking. Because in theory, sentinel and boolean mask are representable in numpy, and all three in arrow.
So we could have an API to get an object with numpy/arrow array interface where you can specify which representation you want (and in case of boolean mask, it would return two array-like objects), and a way to inspect the "native" representation.

If missing values are the only such case where we might need multiple memory layouts, I think the combination of numpy and arrow array interface should be able to cover most/all use cases.

(to clarify, I am using "numpy array interface" here to mean __array_interface__ representation in numpy, which doesn't need to be tied to the numpy package)

kkraus14 · 2020-10-01T17:52:19Z

How could one specify a sentinel value using __array_interface__ or __arrow_array_interface__ today?

jorisvandenbossche · 2020-10-01T18:46:23Z

I am assuming here that we would have a method in the standard that returns you an object with the array interface (let's call it to_array_interface() as a placeholder for now).
But that's something to discuss of course, there have been varying proposals on this front. Personally I think we need some way to specify certain conversion options, whether it is directly in __dataframe__(..., keyword=..), or in a method on the object returned by __dataframe__.

But so if we have a method like to_array_interface() to get back on object that supports __array_interface__ (or directly returning the contents of __array_interface__), it could have keywords like na_value=..., or return_mask=False/True, or ..

kkraus14 · 2020-10-01T19:16:03Z

This brings us back to the problem of giving memory as it exists if it's in a compatible format. I.E. if a library uses -128 as a sentinel value in an int8 column how can it convey that information to a downstream library as opposed to the downstream library specifying "I want an array with -128 as the sentinel value".

jorisvandenbossche · 2020-10-01T19:27:53Z

In my comment above #30 (comment), I have ".. and a way to inspect the "native" representation.".
We could have a standard way to ask the object what its missing value representation is (exact API to be discussed).

And then we can still choose whether the default behaviour of to_array_interface() (without keywords specified by the user) is to return the native representation, or to follow a chosen standard as default for the keywords. But in both cases the caller of the API can inspect the native representation, and based on that (if they want) ask to get the array in its native representation.

If we want to go the way of allowing multiple possible memory layouts (instead of standardizing on a single memory layout), I think it is inevitable that we need to provide both the ability to inspect the format of the object as the ability to receive the data in a specific format.

protocol/dataframe_protocol_summary.md

jorisvandenbossche · 2020-11-05T16:42:31Z

The last dataframe meeting, we spend quite some time discussing another potential requirement: being able to exchange the data in chunks (I don't know if there are notes from this meeting? I don't see it at https://github.com/data-apis/workgroup/tree/master/meeting_minutes, and they are also not in the hackmd anymore).
Some notes about this (from before the meeting) are at #29 (comment) by @maartenbreddels

rgommers · 2020-11-05T17:58:29Z

I don't know if there are notes from this meeting?

Thanks for the reminder about this topic. Yes there are notes. I've been out of the running for the last 6 weeks so I haven't opened any PRs with meeting minutes since mid-Sep; will try to get to it within the next few days. Or at least summarize the chunk discussion from the raw notes.

EDIT: those meeting minutes are at https://github.com/data-apis/workgroup/blob/main/meeting_minutes/workgroup_meeting_2020_09_17.md

rgommers · 2020-11-18T23:34:14Z

Ah, this PR was auto-closed by moving master to main, will fix up tomorrow morning.

rgommers mentioned this pull request Sep 15, 2020

Data exchange formats #29

Closed

jorisvandenbossche reviewed Sep 15, 2020

View reviewed changes

kkraus14 reviewed Sep 15, 2020

View reviewed changes

devin-petersohn reviewed Sep 16, 2020

View reviewed changes

protocol/dataframe_protocol_summary.md Outdated Show resolved Hide resolved

protocol/dataframe_protocol_summary.md Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Sep 16, 2020

View reviewed changes

datapythonista mentioned this pull request Sep 16, 2020

Step by step guide to understand protocol discussion, and decisions to be made #31

Closed

Process some review comments

9bf13db

rgommers added 4 commits November 5, 2020 15:00

Process a few more review comments.

220388d

Link to Release callback semantics in Arrow C Data Interface docs

645b26c

Add design requirements for column selection and df metadata

5610332

Edit the nested/heterogeneous dtypes non-requirement

4a8e6dd

jorisvandenbossche reviewed Nov 5, 2020

View reviewed changes

protocol/dataframe_protocol_summary.md Show resolved Hide resolved

rgommers closed this Nov 14, 2020

rgommers deleted the branch master November 14, 2020 17:30

rgommers mentioned this pull request Jan 5, 2021

Requirements document for the dataframe interchange protocol #35

Merged

Add a summary document for the dataframe interchange protocol #30

Add a summary document for the dataframe interchange protocol #30

Conversation

rgommers commented Sep 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Sep 16, 2020

jorisvandenbossche commented Sep 30, 2020

maartenbreddels commented Oct 1, 2020

jorisvandenbossche commented Oct 1, 2020 • edited by rgommers Loading

kkraus14 commented Oct 1, 2020

jorisvandenbossche commented Oct 1, 2020

kkraus14 commented Oct 1, 2020

jorisvandenbossche commented Oct 1, 2020

jorisvandenbossche commented Nov 5, 2020

rgommers commented Nov 5, 2020 • edited Loading

rgommers commented Nov 18, 2020

rgommers commented Sep 15, 2020 •

edited

Loading

jorisvandenbossche commented Oct 1, 2020 •

edited by rgommers

Loading

rgommers commented Nov 5, 2020 •

edited

Loading