Alternative parallel execution frameworks in xarray #6807

TomNicholas · 2022-07-18T21:48:10Z

Is your feature request related to a problem?

Since early on the project xarray has supported wrapping dask.array objects in a first-class manner. However recent work on flexible array wrapping has made it possible to wrap all sorts of array types (and with #6804 we should support wrapping any array that conforms to the array API standard).

Currently though the only way to parallelize array operations with xarray "automatically" is to use dask. (You could use xarray-beam or other options too but they don't "automatically" generate the computation for you like dask does.)

When dask is the only type of parallel framework exposing an array-like API then there is no need for flexibility, but now we have nascent projects like cubed to consider too. @tomwhite

Describe the solution you'd like

Refactor the internals so that dask is one option among many, and that any newer options can plug in in an extensible way.

In particular cubed deliberately uses the same API as dask.array, exposing:

the methods needed to conform to the array API standard
a .chunk and .compute method, which we could dispatch to
dask-like functions to create computation graphs including blockwise, map_blocks, and rechunk

I would like to see xarray able to wrap any array-like object which offers this set of methods / functions, and call the corresponding version of that method for the correct library (i.e. dask vs cubed) automatically.

That way users could try different parallel execution frameworks simply via a switch like

ds.chunk(**chunk_pattern, manager="dask")

and see which one works best for their particular problem.

Describe alternatives you've considered

If we leave it the way it is now then xarray will not be truly flexible in this respect.

Any library can wrap (or subclass if they are really brave) xarray objects to provide parallelism but that's not the same level of flexibility.

Additional context

cubed repo

PR about making xarray able to wrap objects conforming to the new array API standard

cc @shoyer @rabernat @dcherian @keewis

The text was updated successfully, but these errors were encountered:

dcherian · 2022-07-18T21:56:58Z

This sounds great! We should finish up #4972 to make it easier to test.

dcherian · 2022-07-19T01:29:28Z

Another parallel framework would be Ramba

cc @DrTodd13

shoyer · 2022-07-19T02:18:03Z

Sounds good to me. The challenge will be defining a parallel computing API that works across all these projects, with their slightly different models.

andersy005 · 2022-07-19T03:22:07Z

at SciPy i learned of fugue which tries to provide a unified API for distributed DataFrames on top of Spark and Dask. it could be a great source of inspiration.

tomwhite · 2022-07-19T10:58:18Z

Thanks for opening this @TomNicholas

The challenge will be defining a parallel computing API that works across all these projects, with their slightly different models.

Agreed. I feel like there's already an implicit set of "chunked array" methods that xarray expects from Dask that could be formalised a bit and exposed as an integration point.

sdbachman · 2022-09-14T20:46:52Z

Might I propose Arkouda?

https://github.com/Bears-R-Us/arkouda
https://chapel-lang.org/presentations/Arkouda_SIAM_PP-22.pdf

DrTodd13 · 2022-09-14T20:57:35Z

Might I propose Arkouda?

https://github.com/Bears-R-Us/arkouda https://chapel-lang.org/presentations/Arkouda_SIAM_PP-22.pdf

Have they improved recently to support more than 1D arrays?

benbovy · 2022-10-13T09:22:27Z

Not really a generic and parallel execution back-end, but Open-EO looks like an interesting use case too (it is a framework for managing remote execution of processing tasks on multiple big Earth observation cloud back-ends via a common API). I've suggested the idea of reusing the Xarray API here: Open-EO/openeo-python-client#334.

TomNicholas · 2022-10-20T19:22:11Z

@rabernat just pointed out to me that in order for this to work well we might also need lazy concatenation of arrays.

Xarray currently has it's own internal wrappers that allow lazy indexing, but they don't yet allow lazy concatenation. Instead dask is what does lazy concatenation under the hood right now.

This is a problem - it means that concatenating two cubed-backed DataArrays will trigger loading both into memory, whereas concatenating two dask-backed DataArrays will not. If #4628 was implemented then xarray would never load the underlying array into memory regardless of the backend.

shoyer · 2022-10-21T03:49:18Z

Cubed should define a concatenate function, so that should be OK

tomwhite · 2022-10-21T09:31:29Z

Cubed implements concat, but perhaps xarray needs richer concat functionality than that?

dcherian · 2022-10-21T15:38:27Z

IIUC the issue Ryan & Tom are talking about is tied to reading from files.

For example, we read from a zarr store using zarr, then wrap that zarr.Array (or h5Py Dataset) with a large number of ExplicitlyIndexed Classes that enable more complicated indexing, lazy decoding etc.

IIUC #4628 is about concatenating such arrays i.e. neither zarr.Array nor ExplicitlyIndexed support concatenation, so we end up calling np.array and forcing a disk read.

With dask or cubed we would have dask(ExplicitlyIndexed(zarr)) or cubed(ExplicitlyIndexed(zarr)) so as long as dask and cubed define concat and we dispatch to them, everything is 👍🏾

PS: This is what I was attempting to explain (not very clearly) in the distributed arrays meeting. We don't ever use dask.array.from_zarr (for e.g.). We use zarr to read, then wrap in ExplicitlyIndexed and then pass to dask.array.from_array.

TomNicholas added topic-internals enhancement topic-dask topic-arrays related to flexible array support labels Jul 18, 2022

This was referenced Jul 18, 2022

Supporting xarray.apply_ufunc cubed-dev/cubed#67

Closed

Automatic duck array testing - reductions #4972

Draft

keewis mentioned this issue Jul 27, 2022

Clarify difference between .load() and .compute() #6837

Open

TomNicholas mentioned this issue Aug 8, 2022

Public testing framework for duck array integration #6894

Open

19 tasks

tomwhite mentioned this issue Sep 5, 2022

Comparison to Xarray-Beam cubed-dev/cubed#117

Open

TomNicholas mentioned this issue Sep 10, 2022

Generalize handling of chunked array types #7019

Merged

15 tasks

tomwhite mentioned this issue Sep 22, 2022

Run on Cubed sgkit-dev/sgkit#908

Open

benbovy mentioned this issue Oct 13, 2022

Idea: Xarray interface Open-EO/openeo-python-client#334

Open

TomNicholas mentioned this issue Feb 8, 2023

Aesara as an array backend in Xarray #7515

Open

TomNicholas mentioned this issue Mar 13, 2023

Using Flox with cubed xarray-contrib/flox#224

Open

TomNicholas added the topic-backends label Mar 15, 2023

TomNicholas added topic-zarr Related to zarr storage library topic-lazy array labels Mar 15, 2023

TomNicholas mentioned this issue Apr 28, 2023

Allow in-memory arrays with open_mfdataset #5704

Open

6 tasks

dcherian closed this as completed in #7019 May 18, 2023

TomNicholas mentioned this issue May 18, 2023

Compatibility with the Array API standard #7848

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative parallel execution frameworks in xarray #6807

Alternative parallel execution frameworks in xarray #6807

TomNicholas commented Jul 18, 2022

dcherian commented Jul 18, 2022

dcherian commented Jul 19, 2022

shoyer commented Jul 19, 2022

andersy005 commented Jul 19, 2022

tomwhite commented Jul 19, 2022

sdbachman commented Sep 14, 2022

DrTodd13 commented Sep 14, 2022

benbovy commented Oct 13, 2022

TomNicholas commented Oct 20, 2022

shoyer commented Oct 21, 2022

tomwhite commented Oct 21, 2022

dcherian commented Oct 21, 2022 •

edited

Loading

Alternative parallel execution frameworks in xarray #6807

Alternative parallel execution frameworks in xarray #6807

Comments

TomNicholas commented Jul 18, 2022

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

dcherian commented Jul 18, 2022

dcherian commented Jul 19, 2022

shoyer commented Jul 19, 2022

andersy005 commented Jul 19, 2022

tomwhite commented Jul 19, 2022

sdbachman commented Sep 14, 2022

DrTodd13 commented Sep 14, 2022

benbovy commented Oct 13, 2022

TomNicholas commented Oct 20, 2022

shoyer commented Oct 21, 2022

tomwhite commented Oct 21, 2022

dcherian commented Oct 21, 2022 • edited Loading

dcherian commented Oct 21, 2022 •

edited

Loading