Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? #39689

Open
Tracked by #39195
jorisvandenbossche opened this issue Jan 18, 2024 · 9 comments

Comments

@jorisvandenbossche
Copy link
Member

Follow-up discussion on the Arrow PyCapsule Protocol semantics added in #37797 (and overview issue promoting it: #39195). Current docs: https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html

This topic came up on the PR itself as well. I brought it up in #37797 (review)), and then we mostly discussed this (with eventually removing __arrow_c_schema__ from the array) in the thread at #37797 (comment).
Rephrasing my question from in the PR discussion:

Should "data" objects also expose their schema through adding a __arrow_c_schema__? (in addition to __arrow_c_array/stream__, on the same object)

So in the merged implementation of the protocol in pyarrow itself, we cleanly separated this: the Array/ChunkedArray/RecordBatch/Table classes have __arrow_c_data/stream__, and the DataType/Field/Schema classes have __arrow_c_schema__.

But not all libraries have a clear concept of a "schema", or at least not as an accessible/dedicated Python object.

For example, for two cases for which I have an open PR to add the protocol: a pandas.DataFrame does have a .dtypes attribute, but that's not a custom object that can expose the schema protocol (it's just a plain Series with data types as the values) (pandas-dev/pandas#56587); and the interchange protocol DataFrame object only exposes column names, and you need to access a column itself to get the dtype, which then is a plain python tuple (so again not something to which the dunder could be added, and it is also not at the dataframe level) (data-apis/dataframe-api#342).

Personally I think it would be useful that one has the ability to inspect the schema of a "data" object, before asking for the actual data. For pyarrow objects you could check the .type or .schema attributes, and then get __arrow_c_schema__, but that gives again something library-specific in the middle, which we want to avoid.

Summarizing the different arguments from our earlier thread about having __arrow_c_schema__ on an array/stream object:

Pro:

  • Library agnostic way to get the schema of an Arrow(Array/Stream)Exportable object, before getting the actual data
  • Reasons you might want to do this:
    • To be able to inspect the schema without data conversions, because getting the data is not necessarily zero-copy (for libraries that are not exactly 1:1 aligned with the Arrow format)
    • If you want to pass a requested_schema, you first need to know the schema you would get, before you can create your desired schema to pass to __arrow_c_array/stream__

Con:

  • Being able to pass an array or stream where a schema is expected is a bit too loose (Quote from Antoine); e.g. it is weird that passing an Array or RecordBatch to pa.schema(..) would work and return a schema (although sidenote from myself: if we want, we can still disallow this, and only accept objects that only have __arrow_c_schema__ in pa.schema(..))
  • Getting the schema of a stream may involve I/O and is a fallible operation, so I think that's more reason to separate them (Quote from David)

I think it would be nice if we can have some guidance for projects about what the best practice is.
(right now I was planning to add __arrow_c_schema__ in the above mentioned PRs because those projects don't have a "schema" object, but ideally I can follow a recommendation, so that consumer libraries can base their usage on such expectation of a schema being available or not)

cc @wjones127 @pitrou @lidavidm

and also cc @kylebarron and @WillAyd as I know you both have been experimenting with the capsule protocol and might have some user experience with it

@lidavidm
Copy link
Member

I think it's reasonable to allow for objects to expose a schema, so long as it's clear what the expectations are (whether this is expected to be a simple accessor, or if it may block, perform I/O, raise exceptions, etc.)

@wjones127
Copy link
Member

It seems like we need to differentiate between "object is a schema" and "object has a schema". One way to do that would be to create an alternative variant of __arrow_c_schema__. The other would be to have a convention that if an object only had __arrow_c_schema__ then it's a "is a" relationship while if it has either __arrow_c_array__ or __arrow_c_stream__, then it's a "has a" relationship. The latter is what Joris is suggesting above, and I think it's a fine idea.

@kylebarron
Copy link
Contributor

although sidenote from myself: if we want, we can still disallow this, and only accept objects that only have __arrow_c_schema__ in pa.schema(..)

I'm not sure how this would work. If you're a pyarrow consumer and want to create a requested_schema to pass into your producer, then wouldn't you need to call pa.schema() on your input object first, inspect that schema, then pass that schema back into your input's __arrow_c_array__? In that case you explicitly want pa.schema to work on any object that exports __arrow_c_stream__, even if it also exports __arrow_c_array__, right?

  • If you want to pass a requested_schema, you first need to know the schema you would get, before you can create your desired schema to pass to __arrow_c_array/stream__

I admit I was scratching my head for how the requested_schema was expected to work if you don't have a reliable way of first getting the schema of the data.

@jorisvandenbossche
Copy link
Member Author

although sidenote from myself: if we want, we can still disallow this, and only accept objects that only have __arrow_c_schema__ in pa.schema(..)

I'm not sure how this would work. If you're a pyarrow consumer and want to create a requested_schema to pass into your producer, then wouldn't you need to call pa.schema() on your input object first, inspect that schema, then pass that schema back into your input's __arrow_c_array__? In that case you explicitly want pa.schema to work on any object that exports __arrow_c_stream__, even if it also exports __arrow_c_array__, right?

That's a good point. For that use case, we indeed want pa.schema(..) to work on an object implementing both the array/stream and schema protocol

@WillAyd
Copy link
Contributor

WillAyd commented Jan 18, 2024

But not all libraries have a clear concept of a "schema", or at least not as an accessible/dedicated Python object.

For example, for two cases for which I have an open PR to add the protocol: a pandas.DataFrame does have a .dtypes attribute, but that's not a custom object that can expose the schema protocol

Maybe pandas should create a new class for its dtypes though that does expose this? Do you know what other libraries have this same limitation?

I would like to add on the con about accepting an stream/array where a schema is expected that if in the future we add dunders like __from_arrow_array_schema__ to third party libraries it would be much more straightforward for them to only have to deal with schema objects

@paleolimbot
Copy link
Member

I just noticed that __arrow_c_schema__ was missing when working on #39985 . This is an interesting read, but I do think that adding __arrow_c_schema__ will be beneficial.

One of the problems is that there are two reasons you might want to call obj.__arrow_c_schema__(), which have been discussed above: Either obj is a data type-like object (e.g., a pyarrow.DataType, a nanoarrow.Schema, or a numpy/dtype.dtype, or obj has a data type (e.g., pyarrow.Array, pandas.Series, numpy.array).

You might want to use the second version if you are a consumer that doesn't understand one of the new types that were just added to the spec and doesn't have the ability to cast. For example:

def split_lines(array):
  schema_src = array.__arrow_c_schema__()
  if nanoarrow.c_schema_view(schema_src).type == "string_view":
    schema_src, array_src = array.__arrow_c_array__(requested_schema=nanoarrow.string())
  else:
    schema_src, array_src = array.__arrow_c_array__()
    
  if nanoarrow.c_schema_view(schema_src).type != "string":
    raise TypeError("array must be string or string_view")

In that case, you really do need the ability to get the data type from the producer in the event you have to request something else. This type of negotiation is (in my view) far superior to maintaining a spec for keyword arguments to __arrow_c_array__() that would help simple consumers get Arrow data they understand (while freeing producers to take advantage of newer/higher performance types without worrying about compatability).

You might want to use the first one if you have a function like:

def cast(array, schema):
   schema_dst = schema.__arrow_c_schema()
   schema_src, array_src, = array.__arrow_c_array__()
   # ...do some casting stuff, maybe in C

Here, it would be very strange if you could pass a pyarrow.Array as the schema argument without an error. I think this can be disambiguated by checking hasattr(obj, "__arrow_c_array__") or hasattr(obj, "__arrow_c_stream__"):

def cast(array, schema):
  if hasattr(obj, "__arrow_c_array__") or hasattr(obj, "__arrow_c_stream__"):
    raise TypeError("Can't pass array-like object as schema")

   schema_dst = schema.__arrow_c_schema()
   schema_src, array_src, = array.__arrow_c_array__()
   # ...do some casting stuff, maybe in C

I will probably bake this in to nanoarrow.c_schema(), perhaps using another argument or another function to enable the case where you do want the data type from something that is array-like.

@kylebarron
Copy link
Contributor

It looks like there's consensus here? If I'm understanding it right:

Defining a "data object" as one that has either an __arrow_c_array__ or __arrow_c_stream__ method:

  • It is suggested that all data objects also have an __arrow_c_schema__ method to describe the existing schema of the Arrow data
  • If data objects support schema negotiation when exporting data (i.e. they respect requested_schema input), then they must have an __arrow_c_schema__ method
  • It is permissible for data objects to not have an __arrow_c_schema__ method, but then schema negotiation will not be possible.
  • An object with only __arrow_c_schema__ is a schema object
  • An object with __arrow_c_schema__ and one of the data methods has a schema object

I'm implementing the PyCapsule Interface for polars and this was brought up because polars uses newer view types internally, and was unsure what to export pola-rs/polars#17676 (comment)

@kylebarron
Copy link
Contributor

One other question about schema negotiation: it seems most helpful when Arrow adds new data types to the spec. I.e. some libraries might be on older Arrow format versions and not yet support string view and binary view types. In that case, the consumer might want to ask for the producer to cast to standard string and binary types. But this does rely on the consumer being able to interpret the producer's schema, right? A library that supports only an older Arrow format version, and thus that doesn't support view types, might error just in reading the schema produced by a newer version?

So this on its own doesn't solve cross-version Arrow format issues, right?

@paleolimbot
Copy link
Member

I think we had assumed that producers would always produce the most compatible output possible by default unless requested otherwise, although it is probably more natural for a producer to want to produce the output that involves the least amount of copying (which would lead to a situation like the one you described). We still might need a request flag (like PyBUF_SIMPLE since it is reasonable that a producer would only want to export the exact layout they have (as opposed to doing extra work to make it potentially easier to consume).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants