Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a prototype of the dataframe interchange protocol #38

Merged
merged 13 commits into from
Jun 25, 2021
Merged

Conversation

rgommers
Copy link
Member

@rgommers rgommers commented Feb 9, 2021

Related to requirements in gh-35.

TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.

Related to requirements in gh-35.

TBD (to be discussed) comments and design decisions at the top
of the file indicate topics for closer review/discussion.
@rgommers rgommers added the enhancement New feature or request label Feb 9, 2021
Design decisions
----------------

**1. Use a separate column abstraction in addition to a dataframe interface.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an interchange protocol I do not see value in having a column abstraction. If we want a way to treat a dataframe as an array we should make that explicit rather than through a column interface. The Modin implementation does not support a "column" abstraction, rather a dataframe with 1 column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Devin, this is a good discussion to have. As the note 7 lines below says, this is not user-facing API. It seems a useful abstraction to in the end get down to a description of memory. There are multiple reasonable ways to model a dataframe though. What this design does is use 5 concepts in 3 classes:

  1. A Buffer class. A "buffer" is a contiguous block of memory - this is the only thing that actually maps to a 1-D array in a sense that it could be converted to NumPy, CuPy, et al.
  2. A Column class. A "column" has a name and a single dtype. It can consist of multiple "chunks". A single chunk of a column (which may be the whole column if num_chunks == 1) is modeled as again a Column instance, and contains 1 data buffer and (optionally) one "mask" for missing data.
  3. A DataFrame class. A "data frame" is an ordered collection of columns. It has a single device, and all its rows are the same length. It can consist of multiple "chunks". A single chunk of a data frame is modeled as again a DataFrame instance.
  4. A "mask" concept. A mask of a single-chunk column is a buffer.
  5. A "chunk" concept. A chunk is a sub-dividing element that can be applied to a data frame or a column.

These I think are all the actual abstractions we use when talking about data frames. It's fairly explicit and easy to understand. An alternative approach would be to do something like what Arrow did - its "buffer" is the same as in this design, but it doesn't use any of the other concepts - its "array" (which is a little like Column here and not at all like the "array" we use in the array API standard) and "children" are more ways of combining buffers than that they directly map to data frame concepts. Its approach seems to be to use the minimal number of concepts to describe memory layout of a data frame, and leave it up to its users to map that to their actual data frame concepts.

Taking into account our previous discussions of this topic as well as the early prototypes sketches out by @wesm and @kkraus14, I believe we did want something that looked more like a minimal dataframe API with explicitly named methods aimed at library developers. Given that, I think this is at least in the right direction. The main question regarding these concepts that I wasn't sure about is: should a column be allowed to have multiple buffers yes or no (in a single chunk)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the note 7 lines below says, this is not user-facing API

Yes, I understood this. A better wording of my statement would have been "I do not see value in having a column abstraction in an interchange protocol."

I guess I do not understand the rationales here and how that maps to the interchange protocol.

- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)

Here you have given examples, and make a point about applications and users, along with the array API. This does not seem to be about the interchange protocol.

Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?

I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.

Now is a good time to push:) Nothing is set in stone.

I do not see value in having a column abstraction in an interchange protocol."

So what would you do with the methods that are now on Column but not on DataFrame? Attach them to DataFrame and make them raise an exception unless it's a single-column dataframe?

I guess I do not understand the rationales here and how that maps to the interchange protocol.

Good point. Those rationales were a summary of opinions on wesm/dataframe-protocol#1, I didn't put enough thought into whether they did or did not apply here.

Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?

In our last meeting we discussed that. There's no such intention to extend currently I'd say, although it's a possibility. The mood was more "let's get this interchange protocol done, and then see based on experience and adoption what the right approach is for a more complete API".

Personally I think a developer API may have some similarities, but will not be a superset of this interchange protocol. For example, I'm not convinced it should have a column abstraction, and certainly not memory-oriented APIs like get_data_buffer or get_mask.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I am one of the guys who push for getting things out, I think we need to be careful when adding new abstractions. I am for a minimally viable protocol in the first version, and I second that adding a Column and even a Buffer seems to follow the path of the current affairs without critically challenging the need in the first version. This is especially dangerous as it has a tendency to become another version to please everyone, especially Pandas' accustomed user base. IMHO we need to keep the balance of usefulness and adoption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can call it Array instead of Column, then we don't have a Column abstraction any more ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub does not have the range of emoji's needed to accurately reflect my initial reaction to that:)

This is better for implementations that may rely on, for example, lazy
computation.

**3. No row names. If a library uses row names, use a regular column for them.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. If two libraries support row labels it would generally be better for both libraries if there was a way to convert between row labels. In general, row labels are metadata and moving metadata into the data can be costly and require copying. I think this places too much of a constraint on dataframe implementations and assumes that moving metadata into data is cheap, which is certainly not the case in Modin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this, this was my interpretation of the linked discussion. Happy to change, let's wait for a few more opinions though - @kkraus14, @maartenbreddels, @TomAugspurger and @datapythonista all seemed to lean towards "no row labels".

Maybe my adding "use a regular column for them" isn't correct though - perhaps should simply have been "is not supported".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly voiced my opinion on row names in that they should either be required and part of the spec or entirely not part of the spec. Making them optional just ends up with pushing everyone to implement them which is what we've seen with everyone following the Pandas API.

In the early days of cuDF we tried to take the stance that we won't implement row names (indexes), eventually enough people pushed back that for Pandas compatibility it's needed that we added a single column Index which then eventually succumbed to the same argument for eventually supporting MultiIndexes. Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.

+1 to just saying "is not supported"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 I don't agree with this idea:

Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.

All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.

I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.

I have a concern that narrowing this down into the least common denominator will result in something that is not particularly useful. I mentioned this in the live meetings but throwing specific exceptions or warnings should absolutely be a part of the spec and can help with this question about optional implementation details.

If there is no support for labels then the only two options are to (1) implicitly move the metadata into the data, which as I mentioned might be expensive, or (2) drop the row labels altogether, which might contain relevant data. Without row labels, there is no operator to explicitly move row labels into the data (reset_index in pandas), so I don't see any other way to handle it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.

95+% of the use cases we saw were people not actually using indexes in any meaningful way. They were constantly using reset_index to get back to RangeIndex objects and then using __getitem__ calls to access row numbers instead of using iloc. Similarly, with MultiIndex we almost always saw just because an operation like groupby or join return a MultiIndex by default and instead of changing the code to use an as_index=False option or something similar, they'd just add a reset_index immediately afterwards.

I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.

If row labels are optional then presumably Pandas will support it and developers will end up continuing to use it unnecessarily and inefficiently, which will then push other libraries to implement it unnecessarily from wanting to be able to integrate.

For what it's worth, on the cuDF side we now somewhat nicely handle the cases of indexing / multi-indexing, but I am pushing to avoid new DataFrame libraries from having to go through the same pain that we did unless they explicitly want developers to use indexing features.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, absent the argument is any discussion on the user's mental model.

I'm not sure that's the argument. My experience is the same as @kkraus14's with code I got from less experienced colleagues in the past: they just use the defaults (which can be as_index=True), then get to "this stuff is weird, how do I get rid of it", then hit upon .reset_index() via a StackOverflow post. There may be some advanced users who make good use of row labels, but I suspect it's a small-ish minority. For the rest, row labels do not seem to be part of their mental model.

Would be good to get some pandas maintainer opinions on this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am a bit a biased pandas maintainer, as I would like to see pandas get rid of default row indices; "default" to be clear, still have the functionality if you explicitly ask for it)

I think regarding data exchange, the index can be passed through just like another column? In which case it might a question whether we want to make it possible to give an indication that a certain column(s) can be used as index (eg a describe_index() that returns the names of index columns, and someone not interested in that can just ignore that method)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this, passing the row labels as a separate buffer (or a "column buffer") would work for me as long as the labels would not need to be moved into the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that's helpful @devin-petersohn. I think it could be combined with a metadata escape hatch perhaps, that would also be separate from the regular data, and that we discussed last time as being desirable to add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this, since we had multiple discussions on this: in the end we settled on going with metadata (added in gh-43), and decided that libraries could store anything there (it's a Dict[str, Any] attribute) so indeed things like row labels or what the index column is can be stored there and it doesn't need to be moved into the set of regular columns. This seemed to work for everyone.

protocol/dataframe_protocol.py Show resolved Hide resolved

def __array_interface__(self):
"""
TBD: implement or not? Will work for all dtypes except bit masks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is just a buffer we don't need to worry about bit masks at this level. That being said, what would the behavior of a GPU-memory buffer be here? I'd push that since we're making a new protocol we should make it device agnostic from the beginning to prevent people from using things that aren't device agnostic, but I'm obviously biased 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the whole design is definitely meant to be device-agnostic.

__array_interface__ itself doesn't have a device concept, but I think by design it's similar to DLPack and the shape, typestr, offset and version of https://numpy.org/doc/stable/reference/arrays.interface.html will work just fine on GPUs?

That said, that is a bit of an abuse of the protocol and the only reason to add it would be to save people that don't want to implement DLPack support some effort (__array_interface__ is the only pure-Python memory description protocol).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we shoved a GPU pointer into __array_interface__ and then a library like numpy tried to read it it would likely segfault or otherwise explode in a not great way. That's the whole reason we implemented __cuda_array_interface__.

Copy link
Member Author

@rgommers rgommers Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we shoved a GPU pointer into __array_interface__ and then a library like numpy tried to read it it would likely segfault or otherwise explode in a not great way.

That'd be consumer library error. Since there's DataFrame.device, the consuming dataframe library must not be calling NumPy unless the device is CPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is an array protocol where presumably there's plenty of array libraries that implement this that don't know to check DataFrame.device. If this was something like __column_interface__ then I could imagine it, but this is an existing protocol so we can't add extra pieces to it unless we push those pieces into the actual protocol from my perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving:) That was my initial thinking @jorisvandenbossche. __array_interface__ should work, especially if device moves to the buffer level. Could be implemented as a property that raises unless device == cpu.

(or __array__, or __arrow_array__).

Those are a less good idea, because they rely on NumPy or Arrow being present, which we cannot force array libraries to depend on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be easier to interface with the interchange protocol than manually converting the buffers.

That doesn't sound pleasant. The idea would be not to do manual memory management, but implement __dlpack__ in NumPy (NEP will be ready soon), and then Pandas et al. can simply call np.from_dlpack(df) if they want to turn the buffer into a NumPy array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for me a big unclear question is how people are going to use this. For example, I assume I am a dataframe library that uses pyarrow under the hood, and I get some object using this protocol. With the current interface, I would still need to manually use the buffers to construct arrow arrays?

Those are a less good idea, because they rely on NumPy or Arrow being present, which we cannot force array libraries to depend on.

It only "forces" the consumer to have it, I think? (and if they already use it, it is not forcing anybody, but just convenient IMO)

It's true that you need a lazy import of numpy/arrow, of course. __array_interface__ is also fine for me, the end result is the same for the user (np.asarray(buffer) working).
We could discuss a similar __arrow_array_interface__ if that would be more acceptable than __arrow_array__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only "forces" the consumer to have it, I think?

No, the semantics of __array__() are: it must return a numpy array. The producer has to do the work here. What you're saying is true for __array_interface__, not __array__.

__arrow_array__ works like __array__: https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol

We could discuss a similar __arrow_array_interface__ if

If it's at the buffer level, that would be identical to __array_interface__ - it's just a description of memory.

I think for me a big unclear question is how people are going to use this. For example, I assume I am a dataframe library that uses pyarrow under the hood, and I get some object using this protocol. With the current interface, I would still need to manually use the buffers to construct arrow arrays?

Yep that's a great question. Right now you need to convert data buffer by buffer. I imagine Pandas would have a from_dataframe function that would:

  1. iterate over columns
  2. per column, check dtype and number of buffers
  3. obtain column data:
    a. if a single buffer, do col = np.from_dlpack(buffer)
    b. if multiple buffers/chunks, then instantiate a numpy array of the correct dtype for the column, e.g. with np.empty, (or an Arrow array if you have dtypes that are Arrow-backed?), and copy the buffers into the array
  4. call a pd.DataFrame constructor with all the column arrays and column names/dtypes.

(add an iteration over chunks if num_chunks > 1)

There's other checks to be made:

  • raise if device isn't CPU
  • if there's bit masks and you want a numpy array, need to either figure out how to do that with np.frombuffer (better), or be lazy and call __dataframe__ again with nan_as_null = True.

It will be instructive to write this implementation, shouldn't be that hard for the simpler dtypes/cases. I could perhaps have a go at that over the weekend.

Copy link
Member Author

@rgommers rgommers Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most interesting topic here is actually not "shall we have __array__ or not, __array_interface__ or not", but "how do you effectively iterate over all chunks/columns/buffers with as little data copying as possible".

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
Comment on lines 150 to 158
Kind :

- 0 : signed integer
- 1 : unsigned integer
- 2 : IEEE floating point
- 20 : boolean
- 21 : string (UTF-8)
- 22 : datetime
- 23 : categorical
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both Kind and Format string. Seems like duplicate information except Kind is less flexible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format strings are quite annoying to parse, so for anything besides datetime I think library authors would have a preference for using kind + bit_width.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the answer be for us to expand Kind in order to sufficiently handle all of the cases that format string handles then? The way I read this spec is everyone will have to produce the format strings regardless as of now so they'll have to deal with the annoyance regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, however producing format string is much easier than parsing - I think it's a minor annoyance.

I don't see how we'd cover all the datetime formats including time zones without a format string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use 16 kind integers for the 16 formats in https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings, then a separate string field for timezone?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about supporting things like Decimal or Fixed Size Lists in the future where we need to specify some semi arbitrary attributes alongside the dtype? If we do then we should make sure that whatever specification we decide on here can eventually support those gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least keep the possibility open to support more data types like decimal and nested types (which probably means we need format strings?)

Comment on lines +203 to +204
- "mapping" : dict, Python-level only (e.g. ``{int: str}``).
None if not a dictionary-style categorical.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the actual dictionary mapping? I.e. if I had a column of ['gold', 'bronze', 'silver', 'gold', 'silver', 'bronze'] with a dictionary of ['gold' < 'silver' < 'bronze'] would this mapping key be: {0: 'gold', 1: 'silver', 2: 'bronze'}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a HUGE -1 on this. We'd want our dictionary to just be another column which is on the device for example.

Copy link
Member Author

@rgommers rgommers Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you put a dictionary in a single column? It'd be a dataframe of two columns, index and value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a categorical column of ['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of ['gold' < 'silver' < 'bronze'] using Arrow formatting:

categorical column: {
    mask_buffer: [119], # 01110111 in binary
    data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
    children: [
        string column: {
            mask_buffer: None,
            offsets_buffer: [0, 4, 10, 16],
            data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
        }
    ]
}

In the case of cuDF all of these buffers on the GPU and getting a host mapping of {0: 'gold', 1: 'silver', 2: 'bronze'} would be extremely expensive for any sufficiently large dictionary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on the above example: children is actually named dictionary in ArrowSchema (a separate struct in the schema rather than being one of the child arrays; history for that decision a little unclear).

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
Comment on lines 264 to 270
# # NOTE: not needed unless one considers nested dtypes
# def get_children(self) -> Iterable[Column]:
# """
# Children columns underneath the column, each object in this iterator
# must adhere to the column specification
# """
# pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we handling strings without this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing some context for this question. NumPy-like fixed length strings would be handled as a single buffer; for variable-length strings it's unclear to me how you would handle them even with get_children.

Copy link
Collaborator

@kkraus14 kkraus14 Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example) and then a Column of integer offsets into the bytes. So for example, say you had a column of strings of: ['joe', null, 'bob', ''] then under the hood the column would look like:

string column: {
    data_buffer: None,
    mask_buffer: [13], # 00001101 in binary
    children: [
        characters int8 column: {
            data_buffer: [106, 111, 101, 98, 111, 98],
            mask_buffer: None
        },
        offsets int32 column: {
            data_buffer: [0, 3, 3, 6, 6],
            mask_buffer: None
        }
    ]
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how we handle strings in cuDF and I suspect that generally people will want variable length strings as opposed to fixed length strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example)

It has multiple ways of doing this? There's exactly zero specification of this in https://arrow.apache.org/docs/format/CDataInterface.html, and I can't find a separate description either. All I see is it uses format strings 'u' and 'U' for regular and large utf-8 strings.

Would you expect all libraries to handle multiple flavors of variable-length strings? So if the type specifiers say "utf-8 string", then there's some self-describing format for how that's laid out with multiple child buffers, each of which can have a different dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a specialization on Variable Binary https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout.

:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.

So Arrow doesn't necessarily use child Columns in the case of Strings, but 3 buffers: validity, offsets, and data.

Thanks for the explanation. I understand now. Okay, that indeed requires a change from what I have in this PR currently. There's three options:

  1. add children columns
  2. add multiple buffers per column (actually we have data and mask, so just adding offsets would be enough).
  3. do both (1) and (2) - perhaps handy to future-proof it for nested dtypes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of option 3 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.

You are welcome to open an issue about that ;)
(but to be clear, you are totally right. I think the explanation of variable size binary is decent (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout), but if you don't know that string is just variable size binary where the bytes represent UTF8, that doesn't help much ..)

Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome to open an issue about that ;)

I know:) Maybe more thinking about a whole docs section with examples and mapping dataframe concepts to Arrow concepts.

Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)

A worry may be the amount of complexity that a consumer is expected to parse. Arbitrary many buffers and nested child arrays is hugely complex; right now there seems to be only the need for:

  • <=3 buffers (data, offsets, mask)
  • one child level for categoricals or strings.

So it might make sense to do that the same way as Arrow, but restrict it to that - everything else is either producer error or an unsupported dtype?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take the 3 buffer approach for string columns we actually don't need a child level for strings as it would be represented directly by the 3 buffers.

If we don't take the 3 buffer approach for string columns and instead go with the children approach, then we'd need 2 child levels to handle a categorical column who's categories are a string column.

protocol/dataframe_protocol.py Show resolved Hide resolved
A column object, with only the methods and properties required by the
interchange protocol defined.

A column can contain one or more chunks. Each chunk can contain either one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.

Eg what does get_data_buffer return for a Column with multiple chunks?

(in Arrow there is the concept of Array and ChunkedArray (and this represents a column of a DataFrame), where the second is just a vector of same-typed arrays)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eg what does get_data_buffer return for a Column with multiple chunks?

That should raise.

I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.

It's possible to have a separate object like ChunkedColumn, but then for symmetry it should also have ChunkedDataFrame probably.

That'll be an interesting discussion to have today. @devin-petersohn preferred fewer concepts (no Column). I think the two extremes here are:

  • Only DataFrame and Buffer
  • All of DataFrame, ChunkedDataFrame, Column, ChunkedColumn, and Buffer.

In the end fewer separate classes just means more methods/properties on classes that will raise when they're called in a situation that's not appropriate.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataFrame could always be chunked, so DataFrame -> ChunkedColumn (with potentially only 1 chunk), to avoid the additional class. But always having to first access the single chunk (if you don't work with multiple chunks) also has its disadvantages of course).

"""


class Buffer:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that didn't occur to me in the discussion we just had: if we get rid of this Buffer class and change it to a plain dict, then we cannot attach __dlpack__ to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could still be attached to the Column chunk, in which case it would only work if the column is backed by a single buffer. But I suppose that is limiting the use of dlpack too much? (because that would mean there is no easy way to get a separate buffer as a (numpy) array)

# either make a copy or hold on to a reference of the column or
# buffer! (not done yet, this is pretty awful ...)
x = np.ctypeslib.as_array(data_pointer,
shape=(_buffer.bufsize // (bitwidth//8),))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ugliness here is the main problem with passing around pointers in a Python-only protocol. If anyone has better ideas, I'd love to hear them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.

__cuda_array_interface__ is directly attached to the object you need to hold on to, which is not the case for this Buffer.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__, and the problem goes away. The dtypes that DLPack does not support are more of an issue.

@rgommers
Copy link
Member Author

rgommers commented Mar 2, 2021

I added an implementation of the protocol for Pandas, to the point where it round-trips for columns with regular numpy dtypes. Hopefully that makes things more concrete.

Most of the API feels okay, just the buffer object doesn't.

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
2. A `Column` class. A *column* has a name and a single dtype. It can consist
of multiple *chunks*. A single chunk of a column (which may be the whole
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
contains 1 data *buffer* and (optionally) one *mask* for missing data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying a Column contains "1 data buffer" is a bit ambiguous. For a Column of strings is the characters buffer or the offsets buffer the data buffer?

Copy link
Member Author

@rgommers rgommers Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't update this bit yet, sorry - we decided already we needed Arrow-like children to properly support strings and categoricals.

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
protocol/pandas_implementation.py Outdated Show resolved Hide resolved
# either make a copy or hold on to a reference of the column or
# buffer! (not done yet, this is pretty awful ...)
x = np.ctypeslib.as_array(data_pointer,
shape=(_buffer.bufsize // (bitwidth//8),))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

protocol/pandas_implementation.py Outdated Show resolved Hide resolved
This shows the simple design doesn't fully work (see the FIXMEs
in the diff). Instead, the `children` concept is needed.
That way the categorical encoded data values can be returned
as a child Column rather than a Buffer, and hence there's the
necessary Column.dtype to interpret the buffer backing the column.
else:
raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")

return buffer, dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointing this out: the method to get at a buffer must also return the dtype needed to interpret the buffer. Because when you have multiple buffers on a column (we decided to have buffer, mask, offset) you cannot look at Column.dtype - that may be categorical or string, which doesn't help in interpreting the buffer.

This was slightly counterintuitive, but it was the thing necessary to make a categorical dtype work (see roundtrip test at end of this file).

Note: variable-length strings not included yet, that'll require children.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an argument to use a child column instead of a buffer in this situation. What is the data buffer for a Categorical Column, the codes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seemed natural to make it the codes (nb. Pandas calls it codes, Arrow "index"), and have the mapping separately. However the issue is that there should be something on the column buffer, even with child columns. And the dtype only says "categorical", which is unrelated to the buffer itself.

I investigated how Arrow does it:

A data type is described using a format string. The format string only encodes information about the top-level type; for nested type, child types are described separately.
...
Dictionary-encoded types do not have a specific format string. Instead, the format string of the base array represents the dictionary index type, and the value type can be read from the dependent dictionary array

....
For dictionary-encoded arrays, the ArrowSchema.format string encodes the index type. The dictionary value type can be read from the ArrowSchema.dictionary structure.

The same holds for ArrowArray structure: while the parent structure points to the index data, the ArrowArray.dictionary points to the dictionary values array.

It works, but conceptually I dislike this approach. The format string is the equivalent to our Column.dtype. If that says int64 (or l format string for Arrow), then you cannot tell if it's just a column of integers or categoricals without looking at other attributes. That makes things unnecessarily complex. In this way, you need a separate dictionary property, which should be checked for being null before using the dtype:

struct ArrowSchema {
  // Array type description
  const char* format;
  const char* name;
  const char* metadata;
  int64_t flags;
  int64_t n_children;
  struct ArrowSchema** children;
  struct ArrowSchema* dictionary;

This feels much more Pythonic:

>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")
>>> df.dtypes
A       int64
B    category   <-- there is an actual categorical dtype here
dtype: object

So my impression is that it's best to have both child columns and always returning a buffer together with its dtype so you can parse the buffer. That makes sense to me, because a buffer without any dtype info is completely useless - so why look up that dtype info somewhere else rather than pass it along together?

The only (minor) reason not to have the dtype on the buffer directly is to make it possible to have one buffer used in different columns with different interpretations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some discussion in gh-26 about categorical dtypes, which links to apache/arrow#4316 for the rationale for why Arrow does it this way (seems related to the limitations of the C++ API).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: variable-length strings not included yet, that'll require children.

In Arrow variable-length strings actually use an offsets and data buffer, not a child array (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)
(the data buffer is not an array in typical sense like a numpy array)

However the issue is that there should be something on the column buffer, even with child columns. And the dtype only says "categorical", which is unrelated to the buffer itself.

I think that's the main difference between how pandas and arrow handle it: whether the codes/indices dtype (int8, int16, etc) is part of the column dtype or not.
If it's part of the column dtype, then you know how to interpret the buffer, and the buffer doesn't need its own dtype to describe it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's part of the column dtype

I can imagine it being part of the dtype, the question is how. Possibilities that come to mind are:

  • in the format string,
  • extending with a 5th field that's normally None except for compound types like this
  • making .dtype have some kind of nesting to describe both the logical dtype and the buffer dtypes.

@rgommers
Copy link
Member Author

rgommers commented May 20, 2021

We had a lot of discussion in various issues and calls. To get an overview, I went through all minutes and other content, and made a list of TODOs and questions for finishing up this PR (plus resolve review comments on this PR):

  • Categorical dtypes: we should allow having null as a category; it should not have a specified meaning, it's just another category that should (e.g.) roundtrip correctly. See conversation in 8 Apr meeting.

  • Categorical dtypes: should they be a dtype in themselves, or should they be a part of the dtype tuple? Currently dtype is (kind, bitwidth, format_str, endianness), with categorical being a value of the kind enum. Is making a 5th element in the dtype, with that element being another dtype 4-tuple, thereby allowing for nesting, sensible?

  • Add a metadata attribute that can be used to store library-specific things. For example, Vaex should be able to store expressions for its virtual columns there.

  • Add a flag to throw an exception if the export cannot be zero-copy. (e.g. for pandas, possible due to block manager where rows are contiguous and columns are not - add a test for that).

  • Make an update in the requirements doc: strided buffers do not need to be supported, people are comfortable with that needing to make a copy. Arrow does not support striding.

  • Signature of the from_dataframe protocol? See Signature for a standard from_dataframe constructor function #42 and meeting of 20 May.

  • What can be reused between implementations in different libraries, and can/should we have a reference implementation? --> question needs answering somewhere.

  • What is the ownership for buffers, who owns the memory? This should be clearly spelled out in the docs. An owner attribute is perhaps needed. See meeting minutes 4 March, How to consume a single buffer & connection to array interchange #39, and comments on this PR.

  • Add a string dtype, with variable-length strings implemented with the same scheme as Arrow uses (an offsets and a data buffer, see Add a prototype of the dataframe interchange protocol #38 (comment)).

  • Add more roundtrip tests.

@rgommers
Copy link
Member Author

As discussed yesterday, we'll merge this in its current state, open a tracking issue (done in gh-46), and address the remaining features in separate follow-up PRs (several of which are already open). I went through the review comments again, and the remaining unresolved ones contain a lot of interesting discussion so I'll leave them unresolved.

@rgommers rgommers merged commit 52abf7a into main Jun 25, 2021
@rgommers rgommers deleted the protocol-impl branch June 25, 2021 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants