Add a prototype of the dataframe interchange protocol #38

rgommers · 2021-02-09T20:37:55Z

Related to requirements in gh-35.

TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

devin-petersohn · 2021-02-09T23:40:04Z

protocol/dataframe_protocol.py

+Design decisions
+----------------
+
+**1. Use a separate column abstraction in addition to a dataframe interface.**


If this is an interchange protocol I do not see value in having a column abstraction. If we want a way to treat a dataframe as an array we should make that explicit rather than through a column interface. The Modin implementation does not support a "column" abstraction, rather a dataframe with 1 column.

Thanks Devin, this is a good discussion to have. As the note 7 lines below says, this is not user-facing API. It seems a useful abstraction to in the end get down to a description of memory. There are multiple reasonable ways to model a dataframe though. What this design does is use 5 concepts in 3 classes:

A Buffer class. A "buffer" is a contiguous block of memory - this is the only thing that actually maps to a 1-D array in a sense that it could be converted to NumPy, CuPy, et al.

A Column class. A "column" has a name and a single dtype. It can consist of multiple "chunks". A single chunk of a column (which may be the whole column if num_chunks == 1) is modeled as again a Column instance, and contains 1 data buffer and (optionally) one "mask" for missing data.

A DataFrame class. A "data frame" is an ordered collection of columns. It has a single device, and all its rows are the same length. It can consist of multiple "chunks". A single chunk of a data frame is modeled as again a DataFrame instance.

A "mask" concept. A mask of a single-chunk column is a buffer.

A "chunk" concept. A chunk is a sub-dividing element that can be applied to a data frame or a column.

These I think are all the actual abstractions we use when talking about data frames. It's fairly explicit and easy to understand. An alternative approach would be to do something like what Arrow did - its "buffer" is the same as in this design, but it doesn't use any of the other concepts - its "array" (which is a little like Column here and not at all like the "array" we use in the array API standard) and "children" are more ways of combining buffers than that they directly map to data frame concepts. Its approach seems to be to use the minimal number of concepts to describe memory layout of a data frame, and leave it up to its users to map that to their actual data frame concepts.

Taking into account our previous discussions of this topic as well as the early prototypes sketches out by @wesm and @kkraus14, I believe we did want something that looked more like a minimal dataframe API with explicitly named methods aimed at library developers. Given that, I think this is at least in the right direction. The main question regarding these concepts that I wasn't sure about is: should a column be allowed to have multiple buffers yes or no (in a single chunk)?

As the note 7 lines below says, this is not user-facing API

Yes, I understood this. A better wording of my statement would have been "I do not see value in having a column abstraction in an interchange protocol."

I guess I do not understand the rationales here and how that maps to the interchange protocol.

- This is how it works in R, Julia and Apache Arrow. - Semantically most existing applications and users treat a column similar to a 1-D array - We should be able to connect a column to the array data interchange mechanism(s)

Here you have given examples, and make a point about applications and users, along with the array API. This does not seem to be about the interchange protocol.

Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?

I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.

I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.

Now is a good time to push:) Nothing is set in stone.

I do not see value in having a column abstraction in an interchange protocol."

So what would you do with the methods that are now on Column but not on DataFrame? Attach them to DataFrame and make them raise an exception unless it's a single-column dataframe?

I guess I do not understand the rationales here and how that maps to the interchange protocol.

Good point. Those rationales were a summary of opinions on wesm/dataframe-protocol#1, I didn't put enough thought into whether they did or did not apply here.

Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?

In our last meeting we discussed that. There's no such intention to extend currently I'd say, although it's a possibility. The mood was more "let's get this interchange protocol done, and then see based on experience and adoption what the right approach is for a more complete API".

Personally I think a developer API may have some similarities, but will not be a superset of this interchange protocol. For example, I'm not convinced it should have a column abstraction, and certainly not memory-oriented APIs like get_data_buffer or get_mask.

Though I am one of the guys who push for getting things out, I think we need to be careful when adding new abstractions. I am for a minimally viable protocol in the first version, and I second that adding a Column and even a Buffer seems to follow the path of the current affairs without critically challenging the need in the first version. This is especially dangerous as it has a tendency to become another version to please everyone, especially Pandas' accustomed user base. IMHO we need to keep the balance of usefulness and adoption.

Maybe we can call it Array instead of Column, then we don't have a Column abstraction any more ;-)

GitHub does not have the range of emoji's needed to accurately reflect my initial reaction to that:)

devin-petersohn · 2021-02-09T23:45:02Z

protocol/dataframe_protocol.py

+This is better for implementations that may rely on, for example, lazy
+computation.
+
+**3. No row names. If a library uses row names, use a regular column for them.**


I disagree with this. If two libraries support row labels it would generally be better for both libraries if there was a way to convert between row labels. In general, row labels are metadata and moving metadata into the data can be costly and require copying. I think this places too much of a constraint on dataframe implementations and assumes that moving metadata into data is cheap, which is certainly not the case in Modin.

I don't have a strong opinion on this, this was my interpretation of the linked discussion. Happy to change, let's wait for a few more opinions though - @kkraus14, @maartenbreddels, @TomAugspurger and @datapythonista all seemed to lean towards "no row labels".

Maybe my adding "use a regular column for them" isn't correct though - perhaps should simply have been "is not supported".

I strongly voiced my opinion on row names in that they should either be required and part of the spec or entirely not part of the spec. Making them optional just ends up with pushing everyone to implement them which is what we've seen with everyone following the Pandas API.

In the early days of cuDF we tried to take the stance that we won't implement row names (indexes), eventually enough people pushed back that for Pandas compatibility it's needed that we added a single column Index which then eventually succumbed to the same argument for eventually supporting MultiIndexes. Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.

+1 to just saying "is not supported"

@kkraus14 I don't agree with this idea:

Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.

All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.

I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.

I have a concern that narrowing this down into the least common denominator will result in something that is not particularly useful. I mentioned this in the live meetings but throwing specific exceptions or warnings should absolutely be a part of the spec and can help with this question about optional implementation details.

If there is no support for labels then the only two options are to (1) implicitly move the metadata into the data, which as I mentioned might be expensive, or (2) drop the row labels altogether, which might contain relevant data. Without row labels, there is no operator to explicitly move row labels into the data (reset_index in pandas), so I don't see any other way to handle it.

All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.

95+% of the use cases we saw were people not actually using indexes in any meaningful way. They were constantly using reset_index to get back to RangeIndex objects and then using __getitem__ calls to access row numbers instead of using iloc. Similarly, with MultiIndex we almost always saw just because an operation like groupby or join return a MultiIndex by default and instead of changing the code to use an as_index=False option or something similar, they'd just add a reset_index immediately afterwards.

I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.

If row labels are optional then presumably Pandas will support it and developers will end up continuing to use it unnecessarily and inefficiently, which will then push other libraries to implement it unnecessarily from wanting to be able to integrate.

For what it's worth, on the cuDF side we now somewhat nicely handle the cases of indexing / multi-indexing, but I am pushing to avoid new DataFrame libraries from having to go through the same pain that we did unless they explicitly want developers to use indexing features.

Again, absent the argument is any discussion on the user's mental model.

I'm not sure that's the argument. My experience is the same as @kkraus14's with code I got from less experienced colleagues in the past: they just use the defaults (which can be as_index=True), then get to "this stuff is weird, how do I get rid of it", then hit upon .reset_index() via a StackOverflow post. There may be some advanced users who make good use of row labels, but I suspect it's a small-ish minority. For the rest, row labels do not seem to be part of their mental model.

Would be good to get some pandas maintainer opinions on this one.

(I am a bit a biased pandas maintainer, as I would like to see pandas get rid of default row indices; "default" to be clear, still have the functionality if you explicitly ask for it)

I think regarding data exchange, the index can be passed through just like another column? In which case it might a question whether we want to make it possible to give an indication that a certain column(s) can be used as index (eg a describe_index() that returns the names of index columns, and someone not interested in that can just ignore that method)

Revisiting this, passing the row labels as a separate buffer (or a "column buffer") would work for me as long as the labels would not need to be moved into the data.

Cool, that's helpful @devin-petersohn. I think it could be combined with a metadata escape hatch perhaps, that would also be separate from the regular data, and that we discussed last time as being desirable to add.

Update on this, since we had multiple discussions on this: in the end we settled on going with metadata (added in gh-43), and decided that libraries could store anything there (it's a Dict[str, Any] attribute) so indeed things like row labels or what the index column is can be stored there and it doesn't need to be moved into the set of regular columns. This seemed to work for everyone.

protocol/dataframe_protocol.py

kkraus14 · 2021-02-09T23:01:28Z

protocol/dataframe_protocol.py

+
+    def __array_interface__(self):
+        """
+        TBD: implement or not? Will work for all dtypes except bit masks.


Given this is just a buffer we don't need to worry about bit masks at this level. That being said, what would the behavior of a GPU-memory buffer be here? I'd push that since we're making a new protocol we should make it device agnostic from the beginning to prevent people from using things that aren't device agnostic, but I'm obviously biased 😄

Yes the whole design is definitely meant to be device-agnostic.

__array_interface__ itself doesn't have a device concept, but I think by design it's similar to DLPack and the shape, typestr, offset and version of https://numpy.org/doc/stable/reference/arrays.interface.html will work just fine on GPUs?

That said, that is a bit of an abuse of the protocol and the only reason to add it would be to save people that don't want to implement DLPack support some effort (__array_interface__ is the only pure-Python memory description protocol).

If we shoved a GPU pointer into __array_interface__ and then a library like numpy tried to read it it would likely segfault or otherwise explode in a not great way. That's the whole reason we implemented __cuda_array_interface__.

If we shoved a GPU pointer into __array_interface__ and then a library like numpy tried to read it it would likely segfault or otherwise explode in a not great way.

That'd be consumer library error. Since there's DataFrame.device, the consuming dataframe library must not be calling NumPy unless the device is CPU.

But this is an array protocol where presumably there's plenty of array libraries that implement this that don't know to check DataFrame.device. If this was something like __column_interface__ then I could imagine it, but this is an existing protocol so we can't add extra pieces to it unless we push those pieces into the actual protocol from my perspective.

Unresolving:) That was my initial thinking @jorisvandenbossche. __array_interface__ should work, especially if device moves to the buffer level. Could be implemented as a property that raises unless device == cpu.

(or __array__, or __arrow_array__).

Those are a less good idea, because they rely on NumPy or Arrow being present, which we cannot force array libraries to depend on.

that will be easier to interface with the interchange protocol than manually converting the buffers.

That doesn't sound pleasant. The idea would be not to do manual memory management, but implement __dlpack__ in NumPy (NEP will be ready soon), and then Pandas et al. can simply call np.from_dlpack(df) if they want to turn the buffer into a NumPy array.

I think for me a big unclear question is how people are going to use this. For example, I assume I am a dataframe library that uses pyarrow under the hood, and I get some object using this protocol. With the current interface, I would still need to manually use the buffers to construct arrow arrays?

Those are a less good idea, because they rely on NumPy or Arrow being present, which we cannot force array libraries to depend on.

It only "forces" the consumer to have it, I think? (and if they already use it, it is not forcing anybody, but just convenient IMO)

It's true that you need a lazy import of numpy/arrow, of course. __array_interface__ is also fine for me, the end result is the same for the user (np.asarray(buffer) working).
We could discuss a similar __arrow_array_interface__ if that would be more acceptable than __arrow_array__

It only "forces" the consumer to have it, I think?

No, the semantics of __array__() are: it must return a numpy array. The producer has to do the work here. What you're saying is true for __array_interface__, not __array__.

__arrow_array__ works like __array__: https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol

We could discuss a similar __arrow_array_interface__ if

If it's at the buffer level, that would be identical to __array_interface__ - it's just a description of memory.

I think for me a big unclear question is how people are going to use this. For example, I assume I am a dataframe library that uses pyarrow under the hood, and I get some object using this protocol. With the current interface, I would still need to manually use the buffers to construct arrow arrays?

Yep that's a great question. Right now you need to convert data buffer by buffer. I imagine Pandas would have a from_dataframe function that would:

iterate over columns

per column, check dtype and number of buffers

obtain column data:
a. if a single buffer, do col = np.from_dlpack(buffer)
b. if multiple buffers/chunks, then instantiate a numpy array of the correct dtype for the column, e.g. with np.empty, (or an Arrow array if you have dtypes that are Arrow-backed?), and copy the buffers into the array

call a pd.DataFrame constructor with all the column arrays and column names/dtypes.

(add an iteration over chunks if num_chunks > 1)

There's other checks to be made:

raise if device isn't CPU

if there's bit masks and you want a numpy array, need to either figure out how to do that with np.frombuffer (better), or be lazy and call __dataframe__ again with nan_as_null = True.

It will be instructive to write this implementation, shouldn't be that hard for the simpler dtypes/cases. I could perhaps have a go at that over the weekend.

The most interesting topic here is actually not "shall we have __array__ or not, __array_interface__ or not", but "how do you effectively iterate over all chunks/columns/buffers with as little data copying as possible".

protocol/dataframe_protocol.py

kkraus14 · 2021-02-10T00:48:37Z

protocol/dataframe_protocol.py

+        Kind :
+
+            - 0 : signed integer
+            - 1 : unsigned integer
+            - 2 : IEEE floating point
+            - 20 : boolean
+            - 21 : string (UTF-8)
+            - 22 : datetime
+            - 23 : categorical


Why do we need both Kind and Format string. Seems like duplicate information except Kind is less flexible.

Format strings are quite annoying to parse, so for anything besides datetime I think library authors would have a preference for using kind + bit_width.

Should the answer be for us to expand Kind in order to sufficiently handle all of the cases that format string handles then? The way I read this spec is everyone will have to produce the format strings regardless as of now so they'll have to deal with the annoyance regardless.

True, however producing format string is much easier than parsing - I think it's a minor annoyance.

I don't see how we'd cover all the datetime formats including time zones without a format string.

Maybe use 16 kind integers for the 16 formats in https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings, then a separate string field for timezone?

Do we care about supporting things like Decimal or Fixed Size Lists in the future where we need to specify some semi arbitrary attributes alongside the dtype? If we do then we should make sure that whatever specification we decide on here can eventually support those gracefully.

I think we should at least keep the possibility open to support more data types like decimal and nested types (which probably means we need format strings?)

kkraus14 · 2021-02-10T00:52:29Z

protocol/dataframe_protocol.py

+            - "mapping" : dict, Python-level only (e.g. ``{int: str}``).
+                          None if not a dictionary-style categorical.


Is this the actual dictionary mapping? I.e. if I had a column of ['gold', 'bronze', 'silver', 'gold', 'silver', 'bronze'] with a dictionary of ['gold' < 'silver' < 'bronze'] would this mapping key be: {0: 'gold', 1: 'silver', 2: 'bronze'}?

Yes indeed.

I'm a HUGE -1 on this. We'd want our dictionary to just be another column which is on the device for example.

How do you put a dictionary in a single column? It'd be a dataframe of two columns, index and value?

Take a categorical column of ['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of ['gold' < 'silver' < 'bronze'] using Arrow formatting:

categorical column: { mask_buffer: [119], # 01110111 in binary data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null children: [ string column: { mask_buffer: None, offsets_buffer: [0, 4, 10, 16], data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101] } ] }

In the case of cuDF all of these buffers on the GPU and getting a host mapping of {0: 'gold', 1: 'silver', 2: 'bronze'} would be extremely expensive for any sufficiently large dictionary.

A note on the above example: children is actually named dictionary in ArrowSchema (a separate struct in the schema rather than being one of the child arrays; history for that decision a little unclear).

protocol/dataframe_protocol.py

kkraus14 · 2021-02-10T00:55:27Z

protocol/dataframe_protocol.py

+#    # NOTE: not needed unless one considers nested dtypes
+#    def get_children(self) -> Iterable[Column]:
+#        """
+#        Children columns underneath the column, each object in this iterator
+#        must adhere to the column specification
+#        """
+#        pass


How are we handling strings without this?

I may be missing some context for this question. NumPy-like fixed length strings would be handled as a single buffer; for variable-length strings it's unclear to me how you would handle them even with get_children.

For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example) and then a Column of integer offsets into the bytes. So for example, say you had a column of strings of: ['joe', null, 'bob', ''] then under the hood the column would look like:

string column: { data_buffer: None, mask_buffer: [13], # 00001101 in binary children: [ characters int8 column: { data_buffer: [106, 111, 101, 98, 111, 98], mask_buffer: None }, offsets int32 column: { data_buffer: [0, 3, 3, 6, 6], mask_buffer: None } ] }

This is also how we handle strings in cuDF and I suspect that generally people will want variable length strings as opposed to fixed length strings.

For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example)

It has multiple ways of doing this? There's exactly zero specification of this in https://arrow.apache.org/docs/format/CDataInterface.html, and I can't find a separate description either. All I see is it uses format strings 'u' and 'U' for regular and large utf-8 strings.

Would you expect all libraries to handle multiple flavors of variable-length strings? So if the type specifiers say "utf-8 string", then there's some self-describing format for how that's laid out with multiple child buffers, each of which can have a different dtype?

a specialization on Variable Binary https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout.

:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.

So Arrow doesn't necessarily use child Columns in the case of Strings, but 3 buffers: validity, offsets, and data.

Thanks for the explanation. I understand now. Okay, that indeed requires a change from what I have in this PR currently. There's three options:

add children columns

add multiple buffers per column (actually we have data and mask, so just adding offsets would be enough).

do both (1) and (2) - perhaps handy to future-proof it for nested dtypes.

I'd be in favor of option 3 😄

:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.

You are welcome to open an issue about that ;)
(but to be clear, you are totally right. I think the explanation of variable size binary is decent (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout), but if you don't know that string is just variable size binary where the bytes represent UTF8, that doesn't help much ..)

Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)

You are welcome to open an issue about that ;)

I know:) Maybe more thinking about a whole docs section with examples and mapping dataframe concepts to Arrow concepts.

Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)

A worry may be the amount of complexity that a consumer is expected to parse. Arbitrary many buffers and nested child arrays is hugely complex; right now there seems to be only the need for:

<=3 buffers (data, offsets, mask)

one child level for categoricals or strings.

So it might make sense to do that the same way as Arrow, but restrict it to that - everything else is either producer error or an unsupported dtype?

If we take the 3 buffer approach for string columns we actually don't need a child level for strings as it would be represented directly by the 3 buffers.

If we don't take the 3 buffer approach for string columns and instead go with the children approach, then we'd need 2 child levels to handle a categorical column who's categories are a string column.

protocol/dataframe_protocol.py

jorisvandenbossche · 2021-02-11T14:16:48Z

protocol/dataframe_protocol.py

+    A column object, with only the methods and properties required by the
+    interchange protocol defined.
+
+    A column can contain one or more chunks. Each chunk can contain either one


I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.

Eg what does get_data_buffer return for a Column with multiple chunks?

(in Arrow there is the concept of Array and ChunkedArray (and this represents a column of a DataFrame), where the second is just a vector of same-typed arrays)

Eg what does get_data_buffer return for a Column with multiple chunks?

That should raise.

I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.

It's possible to have a separate object like ChunkedColumn, but then for symmetry it should also have ChunkedDataFrame probably.

That'll be an interesting discussion to have today. @devin-petersohn preferred fewer concepts (no Column). I think the two extremes here are:

Only DataFrame and Buffer

All of DataFrame, ChunkedDataFrame, Column, ChunkedColumn, and Buffer.

In the end fewer separate classes just means more methods/properties on classes that will raise when they're called in a situation that's not appropriate.

The DataFrame could always be chunked, so DataFrame -> ChunkedColumn (with potentially only 1 chunk), to avoid the additional class. But always having to first access the single chunk (if you don't work with multiple chunks) also has its disadvantages of course).

rgommers · 2021-02-11T19:13:38Z

protocol/dataframe_protocol.py

+"""
+
+
+class Buffer:


One thing that didn't occur to me in the discussion we just had: if we get rid of this Buffer class and change it to a plain dict, then we cannot attach __dlpack__ to it.

It could still be attached to the Column chunk, in which case it would only work if the column is backed by a single buffer. But I suppose that is limiting the use of dlpack too much? (because that would mean there is no easy way to get a separate buffer as a (numpy) array)

rgommers · 2021-03-02T23:35:11Z

protocol/pandas_implementation.py

+    #       either make a copy or hold on to a reference of the column or
+    #       buffer! (not done yet, this is pretty awful ...)
+    x = np.ctypeslib.as_array(data_pointer,
+                              shape=(_buffer.bufsize // (bitwidth//8),))


The ugliness here is the main problem with passing around pointers in a Python-only protocol. If anyone has better ideas, I'd love to hear them.

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.

__cuda_array_interface__ is directly attached to the object you need to hold on to, which is not the case for this Buffer.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__, and the problem goes away. The dtypes that DLPack does not support are more of an issue.

rgommers · 2021-03-02T23:37:53Z

I added an implementation of the protocol for Pandas, to the point where it round-trips for columns with regular numpy dtypes. Hopefully that makes things more concrete.

Most of the API feels okay, just the buffer object doesn't.

protocol/dataframe_protocol.py

kkraus14 · 2021-03-03T00:47:50Z

protocol/dataframe_protocol.py

+2. A `Column` class. A *column* has a name and a single dtype. It can consist
+   of multiple *chunks*. A single chunk of a column (which may be the whole
+   column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
+   contains 1 data *buffer* and (optionally) one *mask* for missing data.


Saying a Column contains "1 data buffer" is a bit ambiguous. For a Column of strings is the characters buffer or the offsets buffer the data buffer?

didn't update this bit yet, sorry - we decided already we needed Arrow-like children to properly support ~~strings and~~ categoricals.

protocol/dataframe_protocol.py

protocol/pandas_implementation.py

kkraus14 · 2021-03-03T01:04:25Z

protocol/pandas_implementation.py

+    #       either make a copy or hold on to a reference of the column or
+    #       buffer! (not done yet, this is pretty awful ...)
+    x = np.ctypeslib.as_array(data_pointer,
+                              shape=(_buffer.bufsize // (bitwidth//8),))


In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

protocol/pandas_implementation.py

This shows the simple design doesn't fully work (see the FIXMEs in the diff). Instead, the `children` concept is needed. That way the categorical encoded data values can be returned as a child Column rather than a Buffer, and hence there's the necessary Column.dtype to interpret the buffer backing the column.

rgommers · 2021-04-07T20:32:26Z

protocol/pandas_implementation.py

+        else:
+            raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")
+
+        return buffer, dtype


Pointing this out: the method to get at a buffer must also return the dtype needed to interpret the buffer. Because when you have multiple buffers on a column (we decided to have buffer, mask, offset) you cannot look at Column.dtype - that may be categorical or string, which doesn't help in interpreting the buffer.

This was slightly counterintuitive, but it was the thing necessary to make a categorical dtype work (see roundtrip test at end of this file).

Note: variable-length strings not included yet, that'll require children.

This could be an argument to use a child column instead of a buffer in this situation. What is the data buffer for a Categorical Column, the codes?

Yes, it seemed natural to make it the codes (nb. Pandas calls it codes, Arrow "index"), and have the mapping separately. However the issue is that there should be something on the column buffer, even with child columns. And the dtype only says "categorical", which is unrelated to the buffer itself.

I investigated how Arrow does it:

A data type is described using a format string. The format string only encodes information about the top-level type; for nested type, child types are described separately.
...
Dictionary-encoded types do not have a specific format string. Instead, the format string of the base array represents the dictionary index type, and the value type can be read from the dependent dictionary array

....
For dictionary-encoded arrays, the ArrowSchema.format string encodes the index type. The dictionary value type can be read from the ArrowSchema.dictionary structure.

The same holds for ArrowArray structure: while the parent structure points to the index data, the ArrowArray.dictionary points to the dictionary values array.

It works, but conceptually I dislike this approach. The format string is the equivalent to our Column.dtype. If that says int64 (or l format string for Arrow), then you cannot tell if it's just a column of integers or categoricals without looking at other attributes. That makes things unnecessarily complex. In this way, you need a separate dictionary property, which should be checked for being null before using the dtype:

struct ArrowSchema { // Array type description const char* format; const char* name; const char* metadata; int64_t flags; int64_t n_children; struct ArrowSchema** children; struct ArrowSchema* dictionary;

This feels much more Pythonic:

>>> df = pd.DataFrame({"A": [1, 2, 5, 1]}) >>> df["B"] = df["A"].astype("category") >>> df.dtypes A int64 B category <-- there is an actual categorical dtype here dtype: object

So my impression is that it's best to have both child columns and always returning a buffer together with its dtype so you can parse the buffer. That makes sense to me, because a buffer without any dtype info is completely useless - so why look up that dtype info somewhere else rather than pass it along together?

The only (minor) reason not to have the dtype on the buffer directly is to make it possible to have one buffer used in different columns with different interpretations.

There's some discussion in gh-26 about categorical dtypes, which links to apache/arrow#4316 for the rationale for why Arrow does it this way (seems related to the limitations of the C++ API).

Note: variable-length strings not included yet, that'll require children.

In Arrow variable-length strings actually use an offsets and data buffer, not a child array (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)
(the data buffer is not an array in typical sense like a numpy array)

However the issue is that there should be something on the column buffer, even with child columns. And the dtype only says "categorical", which is unrelated to the buffer itself.

I think that's the main difference between how pandas and arrow handle it: whether the codes/indices dtype (int8, int16, etc) is part of the column dtype or not.
If it's part of the column dtype, then you know how to interpret the buffer, and the buffer doesn't need its own dtype to describe it.

If it's part of the column dtype

I can imagine it being part of the dtype, the question is how. Possibilities that come to mind are:

in the format string,

extending with a 5th field that's normally None except for compound types like this

making .dtype have some kind of nesting to describe both the logical dtype and the buffer dtypes.

rgommers · 2021-05-20T14:54:09Z

rgommers · 2021-06-25T20:19:19Z

As discussed yesterday, we'll merge this in its current state, open a tracking issue (done in gh-46), and address the remaining features in separate follow-up PRs (several of which are already open). I went through the review comments again, and the remaining unresolved ones contain a lot of interesting discussion so I'll leave them unresolved.

Add a prototype of the dataframe interchange protocol

9f82870

Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

rgommers added the enhancement New feature or request label Feb 9, 2021

devin-petersohn reviewed Feb 9, 2021

View reviewed changes

kkraus14 reviewed Feb 10, 2021

View reviewed changes

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved

Remove Column.name, rename get_buffer to get_data_buffer

b201c68

kkraus14 reviewed Feb 10, 2021

View reviewed changes

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Feb 11, 2021

View reviewed changes

rgommers added 2 commits February 11, 2021 17:52

Remove __array_interface__

61d84f3

Add "non-nullable" to Column.describe_null

dc3b373

rgommers commented Feb 11, 2021

View reviewed changes

rgommers added 2 commits March 1, 2021 11:54

Address some review comments and add more docs

cb338fd

Add a Pandas implementation of the interchange protocol

35d3c0d

rgommers commented Mar 2, 2021

View reviewed changes

kkraus14 reviewed Mar 3, 2021

View reviewed changes

rgommers mentioned this pull request Mar 3, 2021

How to consume a single buffer & connection to array interchange #39

Open

rgommers added 7 commits April 7, 2021 16:01

Minor change: add a test for strided columns

90b4f42

Address some code review comments

c08ec10

Partial support for categorical dtypes - export works

9c2717b

Add describe_categorical support and a buffer __repr__

552b794

Make roundtripping with categorical dype work (with some cheating)

cfabb9f

Make the roundtripping for a categorical column work

81ec86e

rgommers commented Apr 7, 2021

View reviewed changes

rgommers mentioned this pull request Apr 8, 2021

storing & exchange of categorical dtypes #41

Open

rgommers mentioned this pull request Jun 25, 2021

Tracking issue: dataframe protocol implementation #46

Open

8 tasks

rgommers merged commit 52abf7a into main Jun 25, 2021

rgommers deleted the protocol-impl branch June 25, 2021 20:19

		- "mapping" : dict, Python-level only (e.g. ``{int: str}``).
		None if not a dictionary-style categorical.

Add a prototype of the dataframe interchange protocol #38

Add a prototype of the dataframe interchange protocol #38

Conversation

rgommers commented Feb 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkraus14 Feb 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Mar 2, 2021

Choose a reason for hiding this comment

rgommers Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented May 20, 2021 • edited Loading

rgommers commented Jun 25, 2021

rgommers Feb 10, 2021 •

edited

Loading

rgommers Feb 11, 2021 •

edited

Loading

rgommers Feb 10, 2021 •

edited

Loading

kkraus14 Feb 10, 2021 •

edited

Loading

jorisvandenbossche Feb 11, 2021 •

edited

Loading

rgommers Mar 3, 2021 •

edited

Loading

rgommers commented May 20, 2021 •

edited

Loading