From b465e39b75101c3e0e54d0fd1dae977b448e7ce6 Mon Sep 17 00:00:00 2001 From: Ralf Gommers Date: Wed, 6 Jan 2021 13:36:56 +0100 Subject: [PATCH] Add details on implementation options --- protocol/dataframe_protocol_summary.md | 31 ++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/protocol/dataframe_protocol_summary.md b/protocol/dataframe_protocol_summary.md index 6b8d5a45..30b6fc7d 100644 --- a/protocol/dataframe_protocol_summary.md +++ b/protocol/dataframe_protocol_summary.md @@ -255,8 +255,39 @@ computational graph approach like Dask uses, etc.)._ ## Possible direction for implementation +### Rough prototypes + The `cuDFDataFrame`, `cuDFColumn` and `cuDFBuffer` sketched out by @kkraus14 [here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386) seems to be in the right direction. +[This prototype](https://github.com/wesm/dataframe-protocol/pull/1) by Wes +McKinney was the first attempt, and has some useful features. + TODO: work this out after making sure we're all on the same page regarding requirements. + + +### Relevant existing protocols + +Here are the four most relevant existing protocols, and what requirements they support: + +| *supports* | buffer protocol | `__array_interface__` | DLPack | Arrow C Data Interface | +|---------------------|:---------------:|:---------------------:|:------:|:----------------------:| +| Python API | | Y | Y | | +| C API | Y | Y | Y | Y | +| arrays | Y | Y | Y | Y | +| dataframes | | | | | +| chunking | | | | | +| devices | | | Y | | +| bool/int/uint/float | Y | Y | Y | Y | +| missing data | (1) | (2) | (3) | Y | +| string dtype | (3) | (3) | | Y | +| datetime dtypes | | (4) | | Y | +| categoricals | (5) | (5) | (6) | (5) | + +1. Can be done only via separate masks of boolean arrays. +2. `__array_interface__` has a `mask` attribute, which is a separate boolean array also implementing the `__array_interface__` protocol. +3. Only fixed-length strings as sequence of char or unicode. +4. Only NumPy datetime and timedelta, which are limited compared to what the Arrow format offers. +5. No explicit support, however categoricals can be mapped to either integers or strings. +6. No explicit support, categoricals can only be mapped to integers.