Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements document for the dataframe interchange protocol #35

Merged
merged 20 commits into from
Jun 25, 2021

Conversation

rgommers
Copy link
Member

@rgommers rgommers commented Jan 5, 2021

Continued from gh-30. Main changes:

Also added one important TBD about the connection with the array data interchange protocol, with a suggestion to add that if and only if the supported dtypes allow (i.e., only boolean, integer and floating-point dtypes, and missing data as separate boolean masks).

Up-to-date rendered version: https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md

Summarizes the various discussions about and goals/non-goals
and requirements for the `__dataframe__` data interchange
protocol.

The intended audience for this document is Consortium members
and dataframe library maintainers who may want to support this
protocol. @datapythonista will add a companion document that's
a more gentle introduction/tutorial in a "from zero to a protocol"
style.

The aim is to keep updating this till we have captured all
the requirements and answered all the FAQs, so we can actually
design the protocol after and verify it meets all our requirements.

Closes gh-29
Also address some smaller review comments.
Also add more details on the Arrow C Data Interface.
protocol/dataframe_protocol_summary.md Show resolved Hide resolved
protocol/dataframe_protocol_summary.md Show resolved Hide resolved
protocol/dataframe_protocol_summary.md Outdated Show resolved Hide resolved
Comment on lines 176 to 180
The main (only?) limitation seems to be that it does not have device support
- @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
that interface only talks about arrays; dataframes, chunking and the metadata
inspection can all be layered on top in this Python-level protocol, but are
not discussed in the interface itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on building Python only pieces on top of the C data interface. Ideally, if we're adopting a C data interface, we should work on extending the interface in the necessary ways desired. If we can't extend such a C data interface, maybe we shouldn't be building on top of it to begin with.

There's a lot of good discussion going on related to DLPack and extending it to support asynchronous device execution:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't extend such a C data interface, maybe we shouldn't be building on top of it to begin with.

Rather than building on top of the C spec, we could adopt the dtype format specification, which seems complete for our needs and well-designed.

Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating ArrowArray and ArrowSchema, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.

Or maybe I'm missing some history here and there are other reasons. @jorisvandenbossche do you know what the story is?

There's a lot of good discussion going on related to DLPack and extending it to support asynchronous device execution

Yep that's starting to look good. Really we need the union of DLPack and the Arrow C Data Interface, plus support for multiple columns and chunking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating ArrowArray and ArrowSchema, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.

Yes, the basic ABI is only an array and schema interface (ArrowArray and ArrowSchema), but those can be used to cover the more complex use cases (chunked arrays, dataframes (record batches)). Arrow itself (the C++ library and bindings in python and R) actually already supports that to export/import arrow Table or RecordBatches using the interface.
A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches

The array and schema structs are indeed separated to allow sending a single schema for multiple arrays (eg to support chunked array): https://arrow.apache.org/docs/format/CDataInterface.html#why-two-distinct-structures

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches

I interpreted "A record batch can be trivially considered as an equivalent struct array with additional top-level metadata" to mean it's about nested dtypes in a single column - that's what a "struct array" in NumPy would be. The example at https://arrow.apache.org/docs/format/CDataInterface.html#exporting-a-struct-float32-utf8-array seems to imply that as well.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but a StructArray is simply a collection of named fields with child arrays (and those child arrays can again be of any type, included nested types), which is basically the same as a "dataframe" (at least in the way we are defining it here for the interchange protocol)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. That does mean nested/structures dtypes have different memory layouts in Arrow and NumPy, which is worth noting explicitly in the item discussion those dtypes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think we should support "structured dtypes" (record arrays) in numpy's sense (which are not columnar), and when considering nested data, only consider the Arrow-like notion of nested data types.
(but that's maybe a discussion for later, since that item is currently in the list of non-requirements. But so indeed might be worth noting that numpy's structured dtype is something completely different as arrow's struct type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of NumPy's way of doing this, but they are columnar. It's more that the definition of "dtype" is different - a NumPy dtype is seen as a single unit, which has fixed size and with 1-D arrays being contiguous memory blocks where the elements with the same dtype are repeated in adjacent memory locations.

In contrast, a 1-D Arrow array can consist of discontiguous blocks of memory because of its looser definition of "dtype".

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, true (with my "not columnar" I wanted to say that the individual "fields" of the nested array are not columnar).

@rgommers rgommers force-pushed the dataframe-interchange-protocol branch from 3d1c690 to b465e39 Compare January 6, 2021 12:45
@rgommers rgommers force-pushed the dataframe-interchange-protocol branch from b465e39 to 1708e03 Compare January 6, 2021 13:19
@rgommers
Copy link
Member Author

rgommers commented Jan 6, 2021

Added an image of the conceptual model of a dataframe including chunks:

image

as well as a table with an overview of the properties of relevant protocols for an implementation:

  • buffer protocol
  • __array_interface__
  • DLPack
  • Arrow C Data Interface

Rendered version: https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

protocol/dataframe_protocol_summary.md Show resolved Hide resolved
protocol/dataframe_protocol_summary.md Outdated Show resolved Hide resolved
Comment on lines 176 to 180
The main (only?) limitation seems to be that it does not have device support
- @kkraus14 will bring this up on the Arrow dev mailing list. Also note that
that interface only talks about arrays; dataframes, chunking and the metadata
inspection can all be layered on top in this Python-level protocol, but are
not discussed in the interface itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I'm not really clear on why that interface only talks about arrays and not dataframes, I found that surprising given Arrow's focus. It seemed to me like chunking was thought about and a main reason for separating ArrowArray and ArrowSchema, but it's all very implicit. The idea seems to be that both multiple chunks and multiple columns are trivial to implement on top, so let's not bother with writing C structures for them.

Yes, the basic ABI is only an array and schema interface (ArrowArray and ArrowSchema), but those can be used to cover the more complex use cases (chunked arrays, dataframes (record batches)). Arrow itself (the C++ library and bindings in python and R) actually already supports that to export/import arrow Table or RecordBatches using the interface.
A very short notice about this in the docs is at https://arrow.apache.org/docs/format/CDataInterface.html#record-batches

The array and schema structs are indeed separated to allow sending a single schema for multiple arrays (eg to support chunked array): https://arrow.apache.org/docs/format/CDataInterface.html#why-two-distinct-structures

| categoricals | (6) | (6) | (7) | (6) |

1. The Python API is only an interface to call the C API under the hood, it
doesn't contain a description of how the data is laid out in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that sense, there is also some "python" functionality for the Arrow C Data Interface (eg you can, in pure python, move data from Python to R with rpy2. pyarrow also includes a cffi submodule: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_cffi.py)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't look like a standardized API to me, so not sure it's worth mentioning. I'd be looking for something like:

  • from_dlpack + __dlpack__
  • from_dataframe + __dataframe__
    that is clearly meant as an API which, if an object implements it, does the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to that: the DLPack version is just a thin standardized layer that connects to C, it does not contain the actual description of memory layout. The only protocol that does that is __array_interface__ which has all the data and metadata available in Python, and then the C side is __array_struct__ which contains identical information.

I'll bring this up in the call today, it's important to be clear on what "Python level API" means exactly, and if we want/need the full Python/C equivalence (I believe that is what we were aiming for, but it got lost a little in this PR and discussion).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outcome of this was Python-only is enough, no one had a real need for a C API.

protocol/dataframe_protocol_summary.md Show resolved Hide resolved
2. Can be done only via separate masks of boolean arrays.
3. `__array_interface__` has a `mask` attribute, which is a separate boolean array also implementing the `__array_interface__` protocol.
4. Only fixed-length strings as sequence of char or unicode.
5. Only NumPy datetime and timedelta, which are limited compared to what the Arrow format offers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For datetime and timedelta, I think that's actually quite similar (Arrow only support them with missing values, but that's the same for all data types). Arrow only in addition also provides date and time data types

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also support datetime and timedelta values in cuDF, though we don't support NaT similar to Arrow. Is NaT something we should call out here or does that fit into the bin of specifying missing values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 16 Arrow format strings for datetime functionality in https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings, so I'd hope that does a little more than just 'm', 'M' in https://numpy.org/devdocs/reference/arrays.interface.html#python-side. I've never actually tried using datetime support in __array_interface__, so it's possible I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this explains it - Pandas (and NumPy) use nanosecond precision (hence only one format string for datetime and one for timedelta), while the Arrow strings can change resolution.

NaT in NumPy is nan rather than NA. I can't find it in the Arrow docs, but given there's a separate null type I imagine NaT is nan as well?

Copy link
Collaborator

@kkraus14 kkraus14 Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, what I wrote was a bit ambiguous. NumPy / Pandas have NaT, which follows the rules of NaN. Arrow / cuDF have null, which follows Kleene logic null semantics. There is not a separate binary representation of NaT, so NumPy / Pandas use min(int64) to represent it.

Do we give a way for a library to indicate a NaT value as opposed to a null value or is that out of scope?

For Pandas in its current state, would they specify min(int64) as a sentinel value for missing values or is that considered conceptually different than null?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this explains it - Pandas (and NumPy) use nanosecond precision (hence only one format string for datetime and one for timedelta), while the Arrow strings can change resolution.

Pandas cannot change resolution (at the moment), but numpy has support for all the resolutions (and more) that arrow has. I don't know how that can be represented in the __array_interface__, but apparently it's just using the parametrized "[unit]" syntax there as well:

In [51]: arr = np.array(["2012-01-01", "NaT"], dtype="datetime64[s]")

In [52]: arr.__array_interface__
Out[52]: 
{'data': (93858349892448, False),
 'strides': None,
 'descr': [('', '<M8[s]')],
 'typestr': '<M8[s]',
 'shape': (2,),
 'version': 3}

So that was what I wanted to say in my original comment above: numpy's datetime64 dtypes can cover all arrow's timestamp types (minus support for nulls, as for other types). I think it's mainly lack of support for timezones that makes numpy's datetime64 types more limited compared to arrow's timestamp types.


Regarding NaT, arrow indeed doesn't support that (only standard nulls). Just checked, and apparently we convert NaT to null upon conversion:

In [53]: arr = np.array(["2012-01-01", "NaT"], dtype="datetime64[s]")

In [54]: pa.array(arr)
Out[54]: 
<pyarrow.lib.TimestampArray object at 0x7fe5b7929e50>
[
  2012-01-01 00:00:00,
  null
]

Do we give a way for a library to indicate a NaT value as opposed to a null value or is that out of scope?

For Pandas in its current state, would they specify min(int64) as a sentinel value for missing values or is that considered conceptually different than null?

I suppose this is covered by number 10 of the requirements currently:

  1. Must allow the consumer to inspect the representation for missing values
    that the producer uses for each column or data type.
    Rationale: this enables the consumer to control how conversion happens,
    for example if the producer uses -128 as a sentinel value in an int8
    column while the consumer uses a separate bit mask, that information
    allows the consumer to make this mapping.

That of course only communicates the memory, and not the actual semantics (NaN-like semantics vs null-like semantics). I am not sure we should get into that ...

Also for float dtype you could ask that question: pandas (mis)uses NaN as missing value indicator. So in this exchange interface, should it indicate that it uses NaN as the null sentinel? While a float column coming from cudf would not indicate NaN as the sentinel, since cudf/arrow uses a bitmask (but still can have NaNs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for float dtype you could ask that question: pandas (mis)uses NaN as missing value indicator. So in this exchange interface, should it indicate that it uses NaN as the null sentinel?

I think it should. Given the new masked float dtypes in Pandas 1.2 that were just released as experimental feature, I'd expect there may be a future with good NA support where nan starts to mean not-a-number again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider introducing an argument for something like nan_as_null that can be passed to the interchange protocol function that allows a library to decide whether to pass its NaN / NaT values as null sentinels or not? NaN is a bit of a special case as I imagine there's different usages where sometimes it represents missing values, where other times it represents invalid values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the nan_as_null idea, agreed that it's common enough that it deserves special-casing.

protocol/dataframe_protocol_summary.md Outdated Show resolved Hide resolved
rgommers added a commit that referenced this pull request Feb 9, 2021
Related to requirements in gh-35.

TBD (to be discussed) comments and design decisions at the top
of the file indicate topics for closer review/discussion.
rgommers added a commit that referenced this pull request Feb 9, 2021
Related to requirements in gh-35.

TBD (to be discussed) comments and design decisions at the top
of the file indicate topics for closer review/discussion.
rgommers added a commit that referenced this pull request Feb 9, 2021
Related to requirements in gh-35.

TBD (to be discussed) comments and design decisions at the top
of the file indicate topics for closer review/discussion.
@rgommers rgommers added the enhancement New feature or request label Jun 25, 2021
@rgommers rgommers merged commit 2e3914f into main Jun 25, 2021
@rgommers rgommers deleted the dataframe-interchange-protocol branch June 25, 2021 19:45
@rgommers
Copy link
Member Author

Went through this again, and made some small updates for things that had changed or were decided on since the last review. The remaining open comments on this PR did not require changes, but do contain interesting content so I decided to leave them unresolved.

@jbrockmendel
Copy link
Contributor

@rgommers in the image #35 (comment) all the columns in the third DataFrame have the same chunking structure (is there another term for this?). I expect this is not actually required, just easiest to draw. Am I right here, and if so should that be made explicit?

@rgommers
Copy link
Member Author

rgommers commented Jul 14, 2022

Hmm good question. The main motivation (from requirement 12 - maybe easier to read in rendered form in https://data-apis.org/dataframe-protocol/latest/design_requirements.html) is: Must support chunking, i.e. accessing the data in “batches” of rows.

Given that columns are contiguous in memory and rows are not, I don't think having different chunks on different columns is a problem, but there are some constraints probably. The reason for that is that both the DataFrame and the Column objects have num_chunks; if they're both used then I'd say a "dataframe chunk" is the outer container, and within that each column can be chunked. In case the dataframe has num_chunks() == 1 then each column can in principle have an independent/arbitrary number of chunks.

Maybe a question to others - is there a reason to forbid this?

@rgommers
Copy link
Member Author

We had a discussion on this in a call 2 weeks ago, and the outcome was:

  • Internally, it is possible for a dataframe to have different chunk sizes for different columns.
    • One example given was that it could have performance advantages in some cases when using a string column (this happens in cuDF it looks like).
  • For interchange, uniform chunks are needed. The question is only if the re-chunking should be done on the producer or the consumer side.
  • Preference was for this to be on the consumer side.
  • Text and diagram here already show equal chunking, and at least the Modin implementation adhered to this (enforced on the producer side).
  • What's needed is a doc clarification stating that the producer must produce equal-size chunks.

I'll open a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants