Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to expose API to downstream libraries? #16

Closed
saulshanabrook opened this issue Aug 10, 2020 · 13 comments
Closed

How to expose API to downstream libraries? #16

saulshanabrook opened this issue Aug 10, 2020 · 13 comments

Comments

@saulshanabrook
Copy link
Contributor

saulshanabrook commented Aug 10, 2020

I wanted to open a discussion on how the Array API (and potentially the dataframe API) will be exposed to downstream libraries.

For example, let's say I am the author of scikit-learn. How do I get access to an "Array compatible API"? Or let's say I am a downstream user, using scikit-learn in a notebook. How can I tell it to use Tensorflow over NumPy?

Options

I present three options here, but I would appreciate any suggestions on further ideas:

Manual

The default option is the current status quo where there is no standard way to get access to some array conformant API backend.

Different downstream libraries, like scikit-learn, could introduce their own mechanisms, like a backend kwarg to functions, if they wanted to support different backends.

Local Dispatch

Another approach, would be to provide access to the related module from particular instances of the objects, which is the one taken by NEP 37.

In this case, scikit-learn would either call some x.__array_module__() method on its inputs or we would provide a array-api Python package that would have a helper function like get_array_module(x), similar to the NEP.

There is an open PR in scikit-learn (scikit-learn/scikit-learn#16574) to add support for NEP 37.

Global Dispatch

Instead of requiring an object to inspect, we could instead rely on a global context to store the "active array api" and provide ways of getting and settings this. Some form of this is implemented by scipy, with their scipy.fft.set_backend, which uses uarray.

This would be heavier weight than we would need, probably, but does illustrate the general concept. I think if we implemented this, we could use Context Variables like python's built in decimal module does. i.e. something like this:

from array_api import set_backend, get_backend

import cupy

with set_backend(cupy):
    some_fn()

def some_fn():
    np = get_backend()
    return np.arange(10)

The advantage of using a global dispatch is then you don't need to rely on passing in some custom instance class to set the backend.

Static Typing

This is slightly tangential, but one question that comes up for me is how we could properly statically type options 2 or 3. It seems like what we need is a typing.Protocol but for modules. I raised this as a discussion point on the typing-sig mailing list.

@rgommers
Copy link
Member

Thanks for summarizing the options @saulshanabrook. I can't think of other reasonable options right now.

I suspect that the local dispatch based on an __array_module__ type approach is the way to go. One question is if that needs to be in the standard itself or not - perhaps in an "adoption" section, but it seems a little meta to put it in the list of array API functions directly.

Also, this may be worth putting a "tentative" label on for the time being (whatever the outcome of this issue), because numpy/scipy/scikit-learn are still playing with the approaches here.

@saulshanabrook
Copy link
Contributor Author

One other thing I forgot to mention is the ability to have a standardized way of getting an "array API conformant object" from some existing object. i.e. an __array__ method that should return some object that conforms to the proposed spec.

If we include this, then it seems like we have three possible things we could possibly document in the spec:

  1. An __array_module__ method on Array instances that should return an array compatible module.
  2. An __array__ method on objects that should return an array spec compatible object. If it doesn't exist, or returns not implemented, we know the object cannot be coerced to an array. I think for things that already are an array, it should return itself, similar to how __iter__ works.
  3. A library we distribute that provides a way to set and get a global array module instance.

Does anyone have opinions on whether we should include these in the spec.

@teoliphant
Copy link

Shouldn't we add the option of creating a package that a downstream author uses and must configure at installation time to use a particular backend?

This is similar to Global Dispatch but implemented using import hooks rather than contexts. For example, what if

import array_api

used the flexible import mechanism of Python to intercept the call and register which library is doing the importing and then return the API-compliant module previously registered for that package in the system.

Another approach would be to encourage libraries that implement the API to create a namespace that uses it and tell downstream authors to prefer that namespace if they want to ensure API compliance.

@rgommers
Copy link
Member

I think the two most sensible options are:

  1. a separate namespace
  2. a simple function, e.g. numpy.get_array_api_module(...)

Something like (2) seems needed anyway, because one must indeed be able to get at the module (or module-like object) both as a user choice via regular imports/function calls, as well as from a particular array instance in library code that supports multiple array types.

Shouldn't we add the option of creating a package that a downstream author uses and must configure at installation time to use a particular backend?

I'd really like to avoid that, the ergonomics are pretty bad with configuration at install time. Sounds like a recipe for lots of bug reports.

An __array__ method on objects that should return an array spec compatible object.

This seems orthogonal to the topic of this issue. __array__ exists and has a well-defined meaning, let's not mess with it.

Another approach would be to encourage libraries that implement the API to create a namespace that uses it and tell downstream authors to prefer that namespace if they want to ensure API compliance.

I suspect that that's what most libraries will want to do. See, e.g., jax.numpy and tf.experimental.numpy for how libraries current approach getting at something mostly numpy-compatible.

@saulshanabrook
Copy link
Contributor Author

saulshanabrook commented Aug 19, 2020

This seems orthogonal to the topic of this issue. __array__ exists and has a well-defined meaning, let's not mess with it.

Ah ok, well we could call it something else? I agree it's a slightly different issue, but related in the question of what "meta" APIs do we provide to allow downstream libraries to at standard array objects and modules.

@rgommers
Copy link
Member

This seems orthogonal to the topic of this issue. __array__ exists and has a well-defined meaning, let's not mess with it.

Ah ok, well we could call it something else? I agree it's a slightly different issue, but related in the question of what "meta" APIs do we provide to allow downstream libraries to at standard array objects and modules.

Well, I think __array__ kind of does what you want. Except when it doesn't (e.g. the semantics people want is memory sharing, no silent copies unless one adds a force keyword), and then I think this is then the __dlpack__ conversation in data-apis/consortium-feedback#1.

But I would not mix that in here, array interchange and exposing the whole API as a module are both large discussions, let's do one at a time.

@rgommers
Copy link
Member

Also for context: __array_module__ (i.e. NEP 37) is in flux right now; JAX merged in experimental support for it this week, and @shoyer is seeing if TensorFlow may want to do that as well. Then we need to actually road-test it. I'd really like to postpone discussion for a bit. It's pretty clear __array_module__ is the leading contender I think, but real-world testing will help confirm that or flag issues.

@saulshanabrook
Copy link
Contributor Author

One other alternative to __array_module__, would be similar be takes no args, and just operates on the array object itself, i.e. x.__array_module() instead of x.__array_module__((type(x)),). This seems like it would be a bit more straightforward for libraries to implement?

But as you said, this conversation is happening outside of this forum.

@shoyer
Copy link
Contributor

shoyer commented Aug 19, 2020

One other alternative to __array_module__, would be similar be takes no args, and just operates on the array object itself, i.e. x.__array_module() instead of x.__array_module__((type(x)),). This seems like it would be a bit more straightforward for libraries to implement?

This is indeed appealing in its simplicity, and would suffice for many use cases, i.e., code that only uses one array type.

It doesn't solve the bigger "multiple library dispatch" problem, but for many projects that isn't so important. Multi-library dispatch could perhaps be added separately with another protocol that determines which array takes priority, and which perhaps could get reused for Python binary arithmetic.

@saulshanabrook
Copy link
Contributor Author

We talked about this in the meeting today. There was support for at least including local dispatch. And some support for global dispatch, if we included local dispatch as well.

Adam brought up a question, that sometimes you might have to implement things differently for performance questions on different libraries.

If we get to this point, Ralf said, then we are in a good/better place then where we are today. For now, we could just try different libraries globally and compare the performance results.

So I think one next step would be to possibly try to write up what a combination of global and local dispatch could look in the spec.

@rgommers
Copy link
Member

I think we formulated a solution for this, in https://data-apis.github.io/array-api/latest/purpose_and_scope.html#how-to-adopt-this-api.

@rgommers
Copy link
Member

This is indeed appealing in its simplicity, and would suffice for many use cases, i.e., code that only uses one array type.

The answer we arrived on here is that even if there are multiple array types involved, those should come from the same library - in which case there is no dispatch problem.

I think we can close this?

@kgryte
Copy link
Contributor

kgryte commented Nov 10, 2021

As this has been addressed via __array_namespace__(), will close out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants