PyData prototype backend dispatching #24

eric-czech · 2020-04-20T14:31:06Z

Two immediately necessary uses for this are:

Dispatching to IO backends
Dispatching to array backends

The second is far more complicated and a separate framework may not be necessary for the first, but it would be great to support both the same way.

On array dispatch, I think dispatching based on argument types is not enough. We will likely have many functions that take multiple array args and if they are a mix of dask/numpy/sparse arrays, a better solution to supporting this is likely to have the user declare what backend API should be preferred and then special-case coercion where necessary.

At a minimum, I think we should keep CuPy, Dask, and Numpy backends in mind since we already know how different implementations of genetic algorithms are going to be based on Alistair's skallel v2 prototype. Each backend will definitely need to use API-specific functionality but a lot of operations will be dispatchable purely through the numpy API too. A good question to answer would be whether or not literally using numpy is better for the latter or if unumpy will make more sense. The backend dispatching model in unumpy seems like a good fit but I don't know if aligning to this long-term is worth the extra dependencies. I think it will depend on how much non API-specific code we actually need.

eric-czech · 2020-04-20T14:32:17Z

cf. this thread on dispatch in Xarray: pydata/xarray#1938

hammer · 2020-04-20T23:35:41Z

Some interesting discussion also happening at pydata/xarray#3213 (comment) with regards to scipy.sparse and pydata/sparse, which may be "backends" to consider as well.

eric-czech · 2020-04-22T19:15:18Z

Some more notes/questions:

Do file readers load as chunked array or not?
- I think it's useful to draw a distinction between dispatching to a "platform" (for lack of a better generic term) for array computation as well as individual array backends (dask is both)
Do file readers load chunks using a specific backend?
- It would not be unreasonable to expect users to need to run da.map_blocks(backend_module.asarray) after reads, but some readers may be much more efficient if they aren't loading chunks as some default duck array type (probably numpy) and then undergoing conversion
Should our genetics methods assume that the same target array backend can be used for all n-ary numpy functions?
- This would be an argument against using unumpy/uarray
- For example, given a numpy array of call data and a sparse mask array for missing calls, should an element-wise multiplication of the two be sparse.COO or np.ndarray (or possibly masked)? If the sparsity is high enough, it should be the former. Assuming a user has specified this by setting the SparseBackend and that the method produces dense results, attempting to stack them will fail:
```
import unumpy as unp
import uarray as ua
import unumpy.sparse_backend as SparseBackend
with ua.set_backend(SparseBackend):
    unp.stack([np.array([1]), np.ones([1])])
# ValueError: All arrays must be instances of SparseArray.
```
- An alternative would be to make something like our "CuPyBackend" backend more of a loose contract that the bulk of the work that happens will be done with CuPy, and that whatever is left is more or less up to us to decide how to implement, by choosing array backends as we see fit within the scope of what is installed. I think this is more realistic given the scope of what our more complicated methods will encompass.
Xarray and xgcm use a "duck_array_ops" module that is basically a switch like getattr(dask.array if dask_installed else numpy, numpy_function)(*args, **kwargs) for handling chunked vs. unchunked dispatching
- For Xarray in particular, this also includes special cases for inconsistencies between dask/numpy as well as implementations from some things outside of the scope of ufunc and array_function protocols

eric-czech mentioned this issue Apr 20, 2020

Build PyData prototype for GWAS analysis #20

Open

12 tasks

hammer added the pydata prototype label Apr 28, 2020

eric-czech added a commit that referenced this issue May 9, 2020

Simplifying and unifying dispatch a good bit across packages #24

1700044

eric-czech added a commit that referenced this issue May 18, 2020

Dispatch for nested backends #24

014ea31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyData prototype backend dispatching #24

PyData prototype backend dispatching #24

eric-czech commented Apr 20, 2020

eric-czech commented Apr 20, 2020

hammer commented Apr 20, 2020

eric-czech commented Apr 22, 2020

PyData prototype backend dispatching #24

PyData prototype backend dispatching #24

Comments

eric-czech commented Apr 20, 2020

eric-czech commented Apr 20, 2020

hammer commented Apr 20, 2020

eric-czech commented Apr 22, 2020