Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyData prototype backend dispatching #24

Open
eric-czech opened this issue Apr 20, 2020 · 3 comments
Open

PyData prototype backend dispatching #24

eric-czech opened this issue Apr 20, 2020 · 3 comments

Comments

@eric-czech
Copy link
Collaborator

Two immediately necessary uses for this are:

  • Dispatching to IO backends
  • Dispatching to array backends

The second is far more complicated and a separate framework may not be necessary for the first, but it would be great to support both the same way.

On array dispatch, I think dispatching based on argument types is not enough. We will likely have many functions that take multiple array args and if they are a mix of dask/numpy/sparse arrays, a better solution to supporting this is likely to have the user declare what backend API should be preferred and then special-case coercion where necessary.

At a minimum, I think we should keep CuPy, Dask, and Numpy backends in mind since we already know how different implementations of genetic algorithms are going to be based on Alistair's skallel v2 prototype. Each backend will definitely need to use API-specific functionality but a lot of operations will be dispatchable purely through the numpy API too. A good question to answer would be whether or not literally using numpy is better for the latter or if unumpy will make more sense. The backend dispatching model in unumpy seems like a good fit but I don't know if aligning to this long-term is worth the extra dependencies. I think it will depend on how much non API-specific code we actually need.

@eric-czech
Copy link
Collaborator Author

cf. this thread on dispatch in Xarray: pydata/xarray#1938

@hammer
Copy link

hammer commented Apr 20, 2020

Some interesting discussion also happening at pydata/xarray#3213 (comment) with regards to scipy.sparse and pydata/sparse, which may be "backends" to consider as well.

@eric-czech
Copy link
Collaborator Author

Some more notes/questions:

  • Do file readers load as chunked array or not?

    • I think it's useful to draw a distinction between dispatching to a "platform" (for lack of a better generic term) for array computation as well as individual array backends (dask is both)
  • Do file readers load chunks using a specific backend?

    • It would not be unreasonable to expect users to need to run da.map_blocks(backend_module.asarray) after reads, but some readers may be much more efficient if they aren't loading chunks as some default duck array type (probably numpy) and then undergoing conversion
  • Should our genetics methods assume that the same target array backend can be used for all n-ary numpy functions?

    • This would be an argument against using unumpy/uarray
    • For example, given a numpy array of call data and a sparse mask array for missing calls, should an element-wise multiplication of the two be sparse.COO or np.ndarray (or possibly masked)? If the sparsity is high enough, it should be the former. Assuming a user has specified this by setting the SparseBackend and that the method produces dense results, attempting to stack them will fail:
    import unumpy as unp
    import uarray as ua
    import unumpy.sparse_backend as SparseBackend
    with ua.set_backend(SparseBackend):
        unp.stack([np.array([1]), np.ones([1])])
    # ValueError: All arrays must be instances of SparseArray.
    • An alternative would be to make something like our "CuPyBackend" backend more of a loose contract that the bulk of the work that happens will be done with CuPy, and that whatever is left is more or less up to us to decide how to implement, by choosing array backends as we see fit within the scope of what is installed. I think this is more realistic given the scope of what our more complicated methods will encompass.
  • Xarray and xgcm use a "duck_array_ops" module that is basically a switch like getattr(dask.array if dask_installed else numpy, numpy_function)(*args, **kwargs) for handling chunked vs. unchunked dispatching

    • For Xarray in particular, this also includes special cases for inconsistencies between dask/numpy as well as implementations from some things outside of the scope of ufunc and array_function protocols

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants