Pass .chunk/rechunk calls through for chunked arrays without ChunkManagers #9286

TomNicholas · 2024-07-26T21:58:02Z

Basically implements @dcherian 's suggestion from #8733 (comment):

IIUC this mostly gets resolved if the ChunkManager is less greedy and doesn't trigger on the existence of .chunks but is instead triggered on matching an allowlist of registered chunk array types.

Needed to fix zarr-developers/VirtualiZarr#199 (comment).

The actual fix is in just the first two commits, the rest is defining a new has_chunkmanager function and using that everywhere to distinguish between arrays that have .chunks (e.g. virtualizarr.ManifestArray) and arrays that actually need to call out to a ChunkManager (i.e. dask/cubed).

Closes A basic default ChunkManager for arrays that report their own chunks #8733, superceding Add a test for usability of duck arrays with chunks property #8739
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
~~New functions/methods are listed in api.rst~~

…rrays

TomNicholas · 2024-07-27T21:21:28Z

xarray/namedarray/core.py

+        if is_chunked_array(data_old):
+            print(f"problematic chunks = {chunks}")
+            # if is_dict_like(chunks) and chunks != {}:
+            #     chunks = tuple(chunks.get(n, s) for n, s in enumerate(data_old.shape))  # type: ignore[assignment]
+
+            print(f"hopefully normalized chunks = {chunks}")


This is really irritating - if I keep these lines commented out then my test_rechunk on the DummyChunkedArray fails. But if I uncomment these lines (therefore doing exactly what happens in the other branch of the if is_chunked_array(data_old): statement) then dask rechunk tests fail!

There are so many possible valid argument types for chunks here, some of which are dicts but completely different, e.g. {0: (2, 3)} vs {'x': (2, 3)}.

It would be much nicer for all possible chunks to go through a single normalize_chunks function, but I'm getting confused even trying to work out what the current behaviour is.

The ChunkManager has a .normalize_chunks method, to call out to dask.array.normalize_chunks. Cubed vendors this function too, so perhaps instead xarray should vendor dask.array.normalize_chunks and remove it from the ChunkManager class?

test_rechunk on the DummyChunkedArray fails

This was actually mostly my fault for having a bug in that test, fixed by 0296f92.

It would be much nicer for all possible chunks to go through a single normalize_chunks function

But there is still some unnecessary complexity that would be nice to remove. The main reason why the weird is_dict_like(chunks): sections that turn dicts of chunks into tuples are currently needed is because of this bug in dask.array.core.normalize_chunks dask/dask#11261. Otherwise we could just use that.

(If we do just use that we should perhaps vendor it though - as cubed does already).

I managed to sort this all out, so now everything goes through dask.array.core.normalize_chunks, which is much neater.

Question is now do I:

Vendor dask.array.core.normalize_chunks (like cubed does), and use the vendored version no matter which ChunkManager is called

Make all chunkmanagers define a normalize_chunks method and refer to that (what the main code currently does).

I think we actually have to do (1), because we now have a codepath which will try to call normalize_chunks even on chunked arrays that do not define a chunkmanager. But we want to vendor it without introducing any more dependencies (e.g. toolz).

@dcherian I would appreciate your input on this vendoring question before I move ahead with it ^

Vendor it! Sorry for the delay. We can generalize if it's ever needed

TomNicholas · 2024-07-27T21:26:55Z

xarray/tests/test_parallelcompat.py

+    def test_computation(self) -> None:
+        dummy_arr = DummyChunkedArray(shape=(4,), chunks=((1,),))
+        na: NamedArray = NamedArray(data=dummy_arr, dims=["x"])
+        na.mean()


Not entirely sure what the intended behaviour should be here. This test tests what happens if you try to compute an array that has .chunks but is not registered via any chunkmanager.

In virtualizarr's case this situation should just raise immediately because ManifestArrays are not computable, so from virtualizarr's PoV it doesn't really matter what happens here.

@hmaarrfk what is the preferred behaviour for your chunked arrays?

I guess if it does attempt to pass computation through here that could cause issues when computing on a cubed array with cubed-xarray not installed... (That scenario can't happen for dask because the equivalent DaskManager (i.e. dask-xarray) is effectively bundled inside xarray.)

The users of my array (the rest of our team) feels like all these should do something.

It might make things REALLY slow, but I feel like mean should compute.... Your chunked array should know how best to compute it for itself. How should it compute intermediate results? In what order should it go through the array.

It might make things REALLY slow, but I feel like mean should compute....

Yes I agree.

How should it compute intermediate results? In what order should it go through the array.

I'm not sure I understand you here.

I guess if it does attempt to pass computation through here that could cause issues when computing on a cubed array with cubed-xarray not installed...

Thinking about this more I don't think it's a big deal. Not recognising a chunked array type will just mean that xarray falls back to calling numpy functions on it (e.g. da.mean() will call np.mean(arr)), which will call __array__ on the underlying type, coercing it to numpy, and in the case of cubed arrays, simply eagerly computing it.

cubed arrays, simply eagerly computing it.

maybe cubed would help us in our lazy arrays.

Me personally that is

Thinking about this more I don't think it's a big deal.

You also kinda have to go out of your way to even created an xarray-wrapped cubed array without cubed-xarray installed, because you can only use open_dataset and .chunk to get cubed arrays if you have cubed-xarray installed.

maybe cubed would help us in our lazy arrays.

Maybe! The cubed.Plan model is super nice.

TomNicholas · 2024-07-27T21:27:36Z

xarray/tests/test_parallelcompat.py

@@ -19,6 +20,7 @@
 from xarray.tests import has_dask, requires_dask


+# TODO can I subclass the chunkedduckarray protocol here?


@Illviljan I hope I'm using all your cool duckarray type protocols correctly!

TomNicholas · 2024-07-27T21:28:30Z

xarray/namedarray/pycompat.py

+    try:
+        get_chunked_array_type(x)
+    except TypeError as e:
+        if str(e).startswith("Could not find a Chunk Manager which recognises type"):
+            return False
+        elif str(e) == "Expected a chunked array but none were found":
+            return False
+        else:
+            raise  # something else went wrong


This might be a code smell, in which case has_chunkmanager, guess_chunkmanager, and get_chunked_array_type should be refactored.

TomNicholas · 2024-07-27T21:30:16Z

xarray/coding/strings.py

@@ -183,7 +183,7 @@ def char_to_bytes(arr):
        # can't make an S0 dtype
        return np.zeros(arr.shape[:-1], dtype=np.bytes_)

-    if is_chunked_array(arr):
+    if is_chunked_array(arr) and has_chunkmanager(arr):


This is_chunked_array(arr) and has_chunkmanager(arr) pattern becomes necessary because we are now considering the possibility that is_chunked_array(arr) == True but has_chunkmanager(arr) == False, whereas previously these were assumed to always be consistent.

@headtr1ck I got a notification saying you commented saying

But doesn't has_chunkmanager(arr) == True imply is_chunked_array(arr) == True?

(But I can't find your comment.)

It's a good question though. I think there are some array types that don't define a .chunks where you might still want to use other ChunkManager methods.

In particular JAX is interesting - it has a top-level pmap function which applies a function over multiple axes of an array similar to apply_gufunc. It distributes computation, but not over .chunks (which JAX doesn't define), instead over a global variable jax.local_device_count.

This is why I think we should rename ChunkManager to ComputeManager.

cc @alxmrs

@headtr1ck I got a notification saying you commented saying

But doesn't has_chunkmanager(arr) == True imply is_chunked_array(arr) == True?

(But I can't find your comment.)

It's a good question though. I think there are some array types that don't define a .chunks where you might still want to use other ChunkManager methods.

I came to the same conclusion, that's why I deleted the comment, sry.

No worries! I prefer to leave all my half-baked thoughts in the open and double or triple-post 😅 If you were wondering it then other people will definitely have the same question!

This is why I think we should rename ChunkManager to ComputeManager.

I could leave this to a second PR, to isolate the breaking changes.

@TomNicholas FYI JAX does now support something a bit like chunking via sharding of jax.Array, there's a good summary here: https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html
IIUC this is now preferred over pmap.

This reverts commit 556161d.

When rechunking with a dict that doesn't contain all axes, then the chunking should be unchanged for those axes that are missing. In particular, `a.rechunk({})` should be a no-op. This is consistent with Dask (dask/dask#11261) and Xarray (pydata/xarray#9286)

for more information, see https://pre-commit.ci

When rechunking with a dict that doesn't contain all axes, then the chunking should be unchanged for those axes that are missing. In particular, `a.rechunk({})` should be a no-op. This is consistent with Dask (dask/dask#11261) and Xarray (pydata/xarray#9286)

TomNicholas added 9 commits July 26, 2024 13:25

test that rechunk can be passed through non registered chunked arrays

f860ce5

remove chunks and rechunk from chunkmanager

77947c3

remove now-redundant chunks method from dask chunkmanager

da6799a

use is_chunked_array instead of hasattr(chunks)

34dadae

add has_chunkmanager function

6e26a2d

use has_chunkmanager

78e8ff4

fix errors in other tests

6feede3

improve docs

621ea0c

test that computation proceeds without dask on unregistered chunked a…

c305b61

…rrays

TomNicholas added the topic-chunked-arrays Managing different chunked backends, e.g. dask label Jul 26, 2024

TomNicholas requested a review from dcherian July 26, 2024 21:58

TomNicholas marked this pull request as ready for review July 26, 2024 21:58

TomNicholas added 6 commits July 26, 2024 18:09

type hinting fixes

4f82d9d

fix other issues revealed by typing

a9bd35d

Merge branch 'main' into non_registered_chunkedarrays

49e2b5f

ensure tests check that chunks are properly normalized

a24489b

remove now-redundant chunks and rechunk methods from DummyChunkManager

2cce6a0

commented-out code indicating problem with chunk normalization

3cadd53

TomNicholas commented Jul 27, 2024

View reviewed changes

TomNicholas added 6 commits July 29, 2024 02:13

fixed bug with chunks passed as dict

556161d

fix dodgy chunking patterns in tests

0296f92

Revert "fixed bug with chunks passed as dict"

9c2ab5e

This reverts commit 556161d.

fixed bug with chunks passed as dict

665727b

remove outdated comments

54adae7

Merge branch 'main' into non_registered_chunkedarrays

5863859

TomNicholas mentioned this pull request Jul 29, 2024

Bug / counterintuitive behaviour in normalize_chunks when dict input doesn't cover all axes dask/dask#11261

Open

TomNicholas mentioned this pull request Jul 30, 2024

Rechunk method for uncompressed arrays zarr-developers/VirtualiZarr#199

Open

7 tasks

TomNicholas added 4 commits July 31, 2024 15:31

refactor to always use the same codepath for chunk normalization

b1024c9

also use new codepath when determining preferred_chunks for backends

e926748

update TODO about what normalization is handled by dask

def3131

remove normalize_chunks method from ChunkManagerEntrypoint

8db961e

TomNicholas mentioned this pull request Aug 6, 2024

Basic rechunking example cubed-dev/cubed#539

Open

tomwhite mentioned this pull request Aug 8, 2024

Rechunk where dict has missing axes cubed-dev/cubed#546

Merged

TomNicholas and others added 9 commits August 12, 2024 15:11

vendor dask.array.normalize_chunks and its dependencies

955c56e

Merge branch 'main' into non_registered_chunkedarrays

58d1cc1

[pre-commit.ci] auto fixes from pre-commit.com hooks

aa3afff

for more information, see https://pre-commit.ci

use Union for python <3.10

bd29f79

some pre-commit fixes

7ad5f07

add __init__.py's to avoid import problems

d14b705

add vendor/__init__.py

aac1566

try to shut mypy up

a0d1e84

Merge branch 'main' into non_registered_chunkedarrays

ce1df3a

TomNicholas mentioned this pull request Aug 30, 2024

Compatibility with the Array API standard #7848

Open

TomNicholas removed the request for review from dcherian September 2, 2024 16:59

TomNicholas mentioned this pull request Sep 5, 2024

Rename ChunkManager to ComputeManager #9435

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass .chunk/rechunk calls through for chunked arrays without ChunkManagers #9286

Pass .chunk/rechunk calls through for chunked arrays without ChunkManagers #9286

TomNicholas commented Jul 26, 2024

TomNicholas Jul 27, 2024

TomNicholas Jul 29, 2024

TomNicholas Jul 31, 2024

TomNicholas Jul 31, 2024

dcherian Aug 6, 2024

TomNicholas Jul 27, 2024

hmaarrfk Jul 27, 2024

TomNicholas Jul 28, 2024

hmaarrfk Jul 29, 2024

hmaarrfk Jul 29, 2024

TomNicholas Jul 29, 2024

TomNicholas Jul 27, 2024

TomNicholas Jul 27, 2024

TomNicholas Jul 27, 2024

TomNicholas Jul 29, 2024

headtr1ck Jul 29, 2024

TomNicholas Jul 29, 2024

mjwillson Sep 6, 2024

		@@ -19,6 +20,7 @@
		from xarray.tests import has_dask, requires_dask


		# TODO can I subclass the chunkedduckarray protocol here?

Pass .chunk/rechunk calls through for chunked arrays without ChunkManagers #9286

Are you sure you want to change the base?

Pass .chunk/rechunk calls through for chunked arrays without ChunkManagers #9286

Conversation

TomNicholas commented Jul 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment