Requirements document for the dataframe interchange protocol #35

rgommers · 2021-01-05T23:28:06Z

Continued from gh-30. Main changes:

added chunking
added memory layout description including need to support multiple representations of missing data
added some more detail on the Arrow C Data Interface
addressed open review comments from Add a summary document for the dataframe interchange protocol #30

Also added one important TBD about the connection with the array data interchange protocol, with a suggestion to add that if and only if the supported dtypes allow (i.e., only boolean, integer and floating-point dtypes, and missing data as separate boolean masks).

Up-to-date rendered version: https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md

@datapythonista

Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29

Also address some smaller review comments.

Also add more details on the Arrow C Data Interface.

protocol/dataframe_protocol_summary.md

kkraus14 · 2021-01-06T03:24:48Z

protocol/dataframe_protocol_summary.md

+The main (only?) limitation seems to be that it does not have device support
+- @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
+that interface only talks about arrays; dataframes, chunking and the metadata
+inspection can all be layered on top in this Python-level protocol, but are
+not discussed in the interface itself.


I'm -1 on building Python only pieces on top of the C data interface. Ideally, if we're adopting a C data interface, we should work on extending the interface in the necessary ways desired. If we can't extend such a C data interface, maybe we shouldn't be building on top of it to begin with.

There's a lot of good discussion going on related to DLPack and extending it to support asynchronous device execution:

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange consortium-feedback#1

Specify synchronization semantics dmlc/dlpack#57

If we can't extend such a C data interface, maybe we shouldn't be building on top of it to begin with.

Rather than building on top of the C spec, we could adopt the dtype format specification, which seems complete for our needs and well-designed.

Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating ArrowArray and ArrowSchema, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.

Or maybe I'm missing some history here and there are other reasons. @jorisvandenbossche do you know what the story is?

There's a lot of good discussion going on related to DLPack and extending it to support asynchronous device execution

Yep that's starting to look good. Really we need the union of DLPack and the Arrow C Data Interface, plus support for multiple columns and chunking.

Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating ArrowArray and ArrowSchema, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.

Yes, the basic ABI is only an array and schema interface (ArrowArray and ArrowSchema), but those can be used to cover the more complex use cases (chunked arrays, dataframes (record batches)). Arrow itself (the C++ library and bindings in python and R) actually already supports that to export/import arrow Table or RecordBatches using the interface.
A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches

The array and schema structs are indeed separated to allow sending a single schema for multiple arrays (eg to support chunked array): https://arrow.apache.org/docs/format/CDataInterface.html#why-two-distinct-structures

A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches

I interpreted "A record batch can be trivially considered as an equivalent struct array with additional top-level metadata" to mean it's about nested dtypes in a single column - that's what a "struct array" in NumPy would be. The example at https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-struct-float32-utf8-array seems to imply that as well.

Yes, but a StructArray is simply a collection of named fields with child arrays (and those child arrays can again be of any type, included nested types), which is basically the same as a "dataframe" (at least in the way we are defining it here for the interchange protocol)

Thanks, that makes sense. That does mean nested/structures dtypes have different memory layouts in Arrow and NumPy, which is worth noting explicitly in the item discussion those dtypes.

Personally, I don't think we should support "structured dtypes" (record arrays) in numpy's sense (which are not columnar), and when considering nested data, only consider the Arrow-like notion of nested data types.
(but that's maybe a discussion for later, since that item is currently in the list of non-requirements. But so indeed might be worth noting that numpy's structured dtype is something completely different as arrow's struct type)

I'm not a huge fan of NumPy's way of doing this, but they are columnar. It's more that the definition of "dtype" is different - a NumPy dtype is seen as a single unit, which has fixed size and with 1-D arrays being contiguous memory blocks where the elements with the same dtype are repeated in adjacent memory locations.

In contrast, a 1-D Arrow array can consist of discontiguous blocks of memory because of its looser definition of "dtype".

Yep, true (with my "not columnar" I wanted to say that the individual "fields" of the nested array are not columnar).

rgommers · 2021-01-06T14:17:30Z

Added an image of the conceptual model of a dataframe including chunks:

as well as a table with an overview of the properties of relevant protocols for an implementation:

buffer protocol
__array_interface__
DLPack
Arrow C Data Interface

Rendered version: https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md

jorisvandenbossche

Thanks for the update!

protocol/dataframe_protocol_summary.md

jorisvandenbossche · 2021-01-06T13:47:35Z

protocol/dataframe_protocol_summary.md

+The main (only?) limitation seems to be that it does not have device support
+- @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
+that interface only talks about arrays; dataframes, chunking and the metadata
+inspection can all be layered on top in this Python-level protocol, but are
+not discussed in the interface itself.


Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating ArrowArray and ArrowSchema, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.

Yes, the basic ABI is only an array and schema interface (ArrowArray and ArrowSchema), but those can be used to cover the more complex use cases (chunked arrays, dataframes (record batches)). Arrow itself (the C++ library and bindings in python and R) actually already supports that to export/import arrow Table or RecordBatches using the interface.
A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches

The array and schema structs are indeed separated to allow sending a single schema for multiple arrays (eg to support chunked array): https://arrow.apache.org/docs/format/CDataInterface.html#why-two-distinct-structures

jorisvandenbossche · 2021-01-06T14:18:58Z

protocol/dataframe_protocol_summary.md

+| categoricals        |       (6)       |          (6)          |   (7)  |           (6)          |
+
+1. The Python API is only an interface to call the C API under the hood, it
+   doesn't contain a description of how the data is laid out in memory.


In that sense, there is also some "python" functionality for the Arrow C Data Interface (eg you can, in pure python, move data from Python to R with rpy2. pyarrow also includes a cffi submodule: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_cffi.py)

That doesn't look like a standardized API to me, so not sure it's worth mentioning. I'd be looking for something like:

from_dlpack + __dlpack__

from_dataframe + __dataframe__
that is clearly meant as an API which, if an object implements it, does the right thing.

To add to that: the DLPack version is just a thin standardized layer that connects to C, it does not contain the actual description of memory layout. The only protocol that does that is __array_interface__ which has all the data and metadata available in Python, and then the C side is __array_struct__ which contains identical information.

I'll bring this up in the call today, it's important to be clear on what "Python level API" means exactly, and if we want/need the full Python/C equivalence (I believe that is what we were aiming for, but it got lost a little in this PR and discussion).

The outcome of this was Python-only is enough, no one had a real need for a C API.

protocol/dataframe_protocol_summary.md

jorisvandenbossche · 2021-01-06T14:21:14Z

protocol/dataframe_protocol_summary.md

+2. Can be done only via separate masks of boolean arrays.
+3. `__array_interface__` has a `mask` attribute, which is a separate boolean array also implementing the `__array_interface__` protocol.
+4. Only fixed-length strings as sequence of char or unicode.
+5. Only NumPy datetime and timedelta, which are limited compared to what the Arrow format offers.


For datetime and timedelta, I think that's actually quite similar (Arrow only support them with missing values, but that's the same for all data types). Arrow only in addition also provides date and time data types

We also support datetime and timedelta values in cuDF, though we don't support NaT similar to Arrow. Is NaT something we should call out here or does that fit into the bin of specifying missing values?

There's 16 Arrow format strings for datetime functionality in https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings, so I'd hope that does a little more than just 'm', 'M' in https://numpy.org/devdocs/reference/arrays.interface.html#python-side. I've never actually tried using datetime support in __array_interface__, so it's possible I'm missing something.

Ah, I think this explains it - Pandas (and NumPy) use nanosecond precision (hence only one format string for datetime and one for timedelta), while the Arrow strings can change resolution.

NaT in NumPy is nan rather than NA. I can't find it in the Arrow docs, but given there's a separate null type I imagine NaT is nan as well?

Apologies, what I wrote was a bit ambiguous. NumPy / Pandas have NaT, which follows the rules of NaN. Arrow / cuDF have null, which follows Kleene logic null semantics. There is not a separate binary representation of NaT, so NumPy / Pandas use min(int64) to represent it.

Do we give a way for a library to indicate a NaT value as opposed to a null value or is that out of scope?

For Pandas in its current state, would they specify min(int64) as a sentinel value for missing values or is that considered conceptually different than null?

Ah, I think this explains it - Pandas (and NumPy) use nanosecond precision (hence only one format string for datetime and one for timedelta), while the Arrow strings can change resolution.

Pandas cannot change resolution (at the moment), but numpy has support for all the resolutions (and more) that arrow has. I don't know how that can be represented in the __array_interface__, but apparently it's just using the parametrized "[unit]" syntax there as well:

In [51]: arr = np.array(["2012-01-01", "NaT"], dtype="datetime64[s]") In [52]: arr.__array_interface__ Out[52]: {'data': (93858349892448, False), 'strides': None, 'descr': [('', '<M8[s]')], 'typestr': '<M8[s]', 'shape': (2,), 'version': 3}

So that was what I wanted to say in my original comment above: numpy's datetime64 dtypes can cover all arrow's timestamp types (minus support for nulls, as for other types). I think it's mainly lack of support for timezones that makes numpy's datetime64 types more limited compared to arrow's timestamp types.

Regarding NaT, arrow indeed doesn't support that (only standard nulls). Just checked, and apparently we convert NaT to null upon conversion:

In [53]: arr = np.array(["2012-01-01", "NaT"], dtype="datetime64[s]") In [54]: pa.array(arr) Out[54]: <pyarrow.lib.TimestampArray object at 0x7fe5b7929e50> [ 2012-01-01 00:00:00, null ]

Do we give a way for a library to indicate a NaT value as opposed to a null value or is that out of scope?

For Pandas in its current state, would they specify min(int64) as a sentinel value for missing values or is that considered conceptually different than null?

I suppose this is covered by number 10 of the requirements currently:

Must allow the consumer to inspect the representation for missing values
that the producer uses for each column or data type.
Rationale: this enables the consumer to control how conversion happens,
for example if the producer uses -128 as a sentinel value in an int8
column while the consumer uses a separate bit mask, that information
allows the consumer to make this mapping.

That of course only communicates the memory, and not the actual semantics (NaN-like semantics vs null-like semantics). I am not sure we should get into that ...

Also for float dtype you could ask that question: pandas (mis)uses NaN as missing value indicator. So in this exchange interface, should it indicate that it uses NaN as the null sentinel? While a float column coming from cudf would not indicate NaN as the sentinel, since cudf/arrow uses a bitmask (but still can have NaNs).

Also for float dtype you could ask that question: pandas (mis)uses NaN as missing value indicator. So in this exchange interface, should it indicate that it uses NaN as the null sentinel?

I think it should. Given the new masked float dtypes in Pandas 1.2 that were just released as experimental feature, I'd expect there may be a future with good NA support where nan starts to mean not-a-number again?

Should we consider introducing an argument for something like nan_as_null that can be passed to the interchange protocol function that allows a library to decide whether to pass its NaN / NaT values as null sentinels or not? NaN is a bit of a special case as I imagine there's different usages where sometimes it represents missing values, where other times it represents invalid values.

I like the nan_as_null idea, agreed that it's common enough that it deserves special-casing.

protocol/dataframe_protocol_summary.md

Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

rgommers · 2021-06-25T19:46:27Z

Went through this again, and made some small updates for things that had changed or were decided on since the last review. The remaining open comments on this PR did not require changes, but do contain interesting content so I decided to leave them unresolved.

jbrockmendel · 2022-07-08T16:09:07Z

@rgommers in the image #35 (comment) all the columns in the third DataFrame have the same chunking structure (is there another term for this?). I expect this is not actually required, just easiest to draw. Am I right here, and if so should that be made explicit?

rgommers · 2022-07-14T16:35:20Z

Hmm good question. The main motivation (from requirement 12 - maybe easier to read in rendered form in https://data-apis.org/dataframe-protocol/latest/design_requirements.html) is: Must support chunking, i.e. accessing the data in “batches” of rows.

Given that columns are contiguous in memory and rows are not, I don't think having different chunks on different columns is a problem, but there are some constraints probably. The reason for that is that both the DataFrame and the Column objects have num_chunks; if they're both used then I'd say a "dataframe chunk" is the outer container, and within that each column can be chunked. In case the dataframe has num_chunks() == 1 then each column can in principle have an independent/arbitrary number of chunks.

Maybe a question to others - is there a reason to forbid this?

rgommers · 2022-07-29T13:56:05Z

We had a discussion on this in a call 2 weeks ago, and the outcome was:

Internally, it is possible for a dataframe to have different chunk sizes for different columns.
- One example given was that it could have performance advantages in some cases when using a string column (this happens in cuDF it looks like).
For interchange, uniform chunks are needed. The question is only if the re-chunking should be done on the producer or the consumer side.
Preference was for this to be on the consumer side.
Text and diagram here already show equal chunking, and at least the Modin implementation adhered to this (enforced on the producer side).
What's needed is a doc clarification stating that the producer must produce equal-size chunks.

I'll open a follow-up PR.

rgommers added 8 commits January 5, 2021 15:17

Process some review comments

31e301f

Process a few more review comments.

a9b5e5c

Link to Release callback semantics in Arrow C Data Interface docs

82bc5ae

Add design requirements for column selection and df metadata

6197aee

Edit the nested/heterogeneous dtypes non-requirement

2f51ba8

Add requirements for chunking and memory layout description

183851d

Also address some smaller review comments.

Add TBD notes on dataframe-array connection and from_dataframe

c7575c1

Also add more details on the Arrow C Data Interface.

kkraus14 reviewed Jan 6, 2021

View reviewed changes

Address review comments

5f278b3

rgommers force-pushed the dataframe-interchange-protocol branch from 3d1c690 to b465e39 Compare January 6, 2021 12:45

Add details on implementation options

1708e03

rgommers force-pushed the dataframe-interchange-protocol branch from b465e39 to 1708e03 Compare January 6, 2021 13:19

rgommers added 2 commits January 6, 2021 14:25

Add details about the C implementation

3291dd9

Add an image of the dataframe model and its memory layout.

e8caeba

jorisvandenbossche reviewed Jan 6, 2021

View reviewed changes

rgommers added 2 commits January 6, 2021 15:59

Add link to discussion on array-dataframe connection

c5de640

Some more updates for review comments

93d6e69

rgommers commented Jan 7, 2021

View reviewed changes

protocol/dataframe_protocol_summary.md Outdated Show resolved Hide resolved

rgommers added 5 commits January 7, 2021 15:28

Update table to indicate Arrow does support categoricals.

7d06066

Add section on dtype format strings

c0b5759

Reflow some lines

6839642

Add a requirement on semantic meaning of NaN/NaT, and timezone detail

53446ac

Textual tweak: say columns in a data frame are ordered

b37ac91

rgommers added a commit that referenced this pull request Feb 9, 2021

Add a prototype of the dataframe interchange protocol

a471301

Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

rgommers added a commit that referenced this pull request Feb 9, 2021

Add a prototype of the dataframe interchange protocol

bd2975a

Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

rgommers added a commit that referenced this pull request Feb 9, 2021

Add a prototype of the dataframe interchange protocol

9f82870

Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

rgommers mentioned this pull request Feb 9, 2021

Add a prototype of the dataframe interchange protocol #38

Merged

rgommers mentioned this pull request Mar 3, 2021

How to consume a single buffer & connection to array interchange #39

Open

rgommers added the enhancement New feature or request label Jun 25, 2021

Update requirements document for recent decisions/insights

be3cd32

rgommers merged commit 2e3914f into main Jun 25, 2021

rgommers deleted the dataframe-interchange-protocol branch June 25, 2021 19:45

rgommers mentioned this pull request Jul 29, 2022

Clarify that chunks in __dataframe__ are equal-size across columns #82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements document for the dataframe interchange protocol #35

Requirements document for the dataframe interchange protocol #35

rgommers commented Jan 5, 2021 •

edited

Loading

kkraus14 Jan 6, 2021

rgommers Jan 6, 2021

jorisvandenbossche Jan 6, 2021

rgommers Jan 6, 2021

jorisvandenbossche Jan 6, 2021 •

edited

Loading

rgommers Jan 7, 2021

jorisvandenbossche Jan 7, 2021

rgommers Jan 7, 2021

jorisvandenbossche Jan 7, 2021 •

edited

Loading

rgommers commented Jan 6, 2021 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Jan 6, 2021

jorisvandenbossche Jan 6, 2021

rgommers Jan 6, 2021

rgommers Jan 7, 2021

rgommers Jan 8, 2021

jorisvandenbossche Jan 6, 2021

kkraus14 Jan 6, 2021

rgommers Jan 6, 2021

rgommers Jan 6, 2021

kkraus14 Jan 6, 2021 •

edited

Loading

jorisvandenbossche Jan 7, 2021

rgommers Jan 7, 2021

kkraus14 Jan 7, 2021

rgommers Jan 8, 2021

rgommers commented Jun 25, 2021

jbrockmendel commented Jul 8, 2022

rgommers commented Jul 14, 2022 •

edited

Loading

rgommers commented Jul 29, 2022

Requirements document for the dataframe interchange protocol #35

Requirements document for the dataframe interchange protocol #35

Conversation

rgommers commented Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 7, 2021 • edited Loading

Choose a reason for hiding this comment

rgommers commented Jan 6, 2021 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkraus14 Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Jun 25, 2021

jbrockmendel commented Jul 8, 2022

rgommers commented Jul 14, 2022 • edited Loading

rgommers commented Jul 29, 2022

rgommers commented Jan 5, 2021 •

edited

Loading

jorisvandenbossche Jan 6, 2021 •

edited

Loading

jorisvandenbossche Jan 7, 2021 •

edited

Loading

rgommers commented Jan 6, 2021 •

edited

Loading

kkraus14 Jan 6, 2021 •

edited

Loading

rgommers commented Jul 14, 2022 •

edited

Loading