Add benchmark for climatology #1552

hendrikmakait · 2024-09-20T09:55:14Z

This benchmark doesn't even run at the small scale yet.

hendrikmakait · 2024-09-20T12:53:41Z

tests/geospatial/test_climatology.py

+        .expand_dims(hour=hours, dayofyear=daysofyear)
+        .assign_coords(hour=hours, dayofyear=daysofyear)
+    )
+    working = working.map_blocks(compute_hourly_climatology, template=template)


I wonder if people typically compute climatology with complex-enough calculations that warrant the use of map_blocks or typically use calculations that leverage Xarray's higher-level API.

People use rechunk+map_blocks because it usually would not work otherwise. It's basically "manual fusion of tasks" to not overwhelm the scheduler. And convenient because you can use Xarray API in there.

I strongly recommend running a version of this without those steps. Ideally, these tricks should not be necessary for this calculation.

Notice that the workload rechunks back to pancakes... The chunking to pencils is purely to get this to work with dask+distributed

Just to make sure, what I'm hearing you say is: The computations people generally execute for calculating climatology, be it mean, std, quantile (or SEEP?) can be expressed in Xarray's high-level API and do not fundamentally require a custom function mapped over all blocks.

In that case, I fully agree that we should avoid map_blocks but express these calculations in Xarray. We should probably benchmark both a computation based on a decomposable aggregation like mean and one based on a holistic aggregation like quantile.

The computations people generally execute for calculating climatology,

Yes. generally. The trick is to look at the function being mapped: in this case it's all Xarray ops and no for loops. That indicates the map_blocks is a workaround for some Xarray and/or dask/distributed inefficiency.

We should probably benchmark both a computation based on a decomposable aggregation like mean and one based on a holistic aggregation like quantile.

Yes totally agree with adding quantile. Dask will force a rechunk :)

PS: What is SEEP?

PPS: I notice this is based on a Xarray-beam workload, it may be that rechunk+map_blocks is the only way to express this in that framework.

Yes. generally. The trick is to look at the function being mapped: in this case it's all Xarray ops and no for loops. That indicates the map_blocks is a workaround for some Xarray and/or dask/distributed inefficiency.

Thanks for the clarification!

PS: What is SEEP?

All I can offer is https://github.com/google-research/weatherbench2/blob/47d72575cf5e99383a09bed19ba989b718d5fe30/scripts/compute_climatology.py#L147-L175. Maybe @shoyer can elaborate on SEEP and how common similar calculations are? Just looking at the code, this can probably be expressed in Xarray's high-level API as well.

PPS: I notice this is based on a Xarray-beam workload, it may be that rechunk+map_blocks is the only way to express this in that framework.

Yes, that's what it's based on, but not what we should be limited by.

In my experience, less experienced users usually try without rechunk+map_blocks first, and then grudgingly resort to that. It'd be nice to model that interaction and improve things if possible.

Yes, rechunk+map_blocks is definitely not the obvious solution. It would be quite nice if we could automatically optimize high-level Xarray code to use a rechunk + map_blocks style implementation when it will be more efficient.

I agree that we won't beat hand-optimized code. However, a major goal of these benchmarks is to understand how we can improve the performance of code that relies on high-level APIs to improve the end-user experience. We all seem to agree that it would be great if users wouldn't have to worry about rewriting their high-level code in a rechunk+map_blocks-style code if it can be avoided. So, wherever possible, I think we should try and benchmark code without optimization and analyze where and how this falls apart.

I'll also add that we want rechunk+map_blocks to always succeed, even if slow, so that it can be a dependable fallback, particularly when the algorithms get more complicated. This would be a major improvement over the dask.array experience today.

@dcherian: Could you elaborate on that last comment? In what situations does that combination not succeed today?

I can imagine this failing if the block sizes or the temporary memory footprint of the mapped function end up being too large. However, this seems like a situation that is beyond the control of higher-level APIs.

It will fail whenever rechunking fails... agree that after that, the blockwise is pretty stable, which is why people use this solution in combination with some kind of batching to avoid the rechunking failure. I don't have an example at hand. This is anecdotal experience from looking at lot of non-expert code at NCAR.

tests/geospatial/test_climatology.py

hendrikmakait · 2024-10-04T14:26:39Z

This should be good for a review now. (I expect that we will iterate on this once we have improved on some things within Xarray/Dask.)

jrbourbeau

Thanks @hendrikmakait

hendrikmakait added 3 commits September 20, 2024 11:54

Add benchmark for climatology

4821a63

Staged changes

b01a554

Make things work

4cff7b5

hendrikmakait commented Sep 20, 2024

View reviewed changes

jrbourbeau reviewed Sep 20, 2024

View reviewed changes

tests/geospatial/test_climatology.py Outdated Show resolved Hide resolved

hendrikmakait added 2 commits September 20, 2024 18:13

Use client factory

98ce10e

high-level code

9fc12f5

hendrikmakait mentioned this pull request Sep 26, 2024

rolling(...).construct(...) blows up chunk size pydata/xarray#9550

Closed

5 tasks

Add both versions

2830a91

hendrikmakait marked this pull request as ready for review October 1, 2024 14:23

hendrikmakait added 2 commits October 1, 2024 16:24

comment

2d64d13

Chunking

b826999

hendrikmakait changed the title ~~[WIP] Add benchmark for climatology~~ Add benchmark for climatology Oct 4, 2024

hendrikmakait requested a review from jrbourbeau October 4, 2024 14:26

jrbourbeau approved these changes Oct 4, 2024

View reviewed changes

jrbourbeau merged commit d2ec39f into main Oct 4, 2024
5 checks passed

jrbourbeau deleted the climatology-benchmark branch October 4, 2024 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark for climatology #1552

Add benchmark for climatology #1552

hendrikmakait commented Sep 20, 2024

hendrikmakait Sep 20, 2024

dcherian Sep 20, 2024 •

edited

Loading

hendrikmakait Sep 20, 2024 •

edited

Loading

dcherian Sep 20, 2024

hendrikmakait Sep 20, 2024 •

edited

Loading

shoyer Sep 20, 2024

hendrikmakait Sep 23, 2024

dcherian Sep 23, 2024

hendrikmakait Sep 23, 2024

dcherian Sep 23, 2024

hendrikmakait commented Oct 4, 2024

jrbourbeau left a comment

Add benchmark for climatology #1552

Add benchmark for climatology #1552

Conversation

hendrikmakait commented Sep 20, 2024

hendrikmakait Sep 20, 2024

Choose a reason for hiding this comment

dcherian Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

hendrikmakait Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

dcherian Sep 20, 2024

Choose a reason for hiding this comment

hendrikmakait Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

shoyer Sep 20, 2024

Choose a reason for hiding this comment

hendrikmakait Sep 23, 2024

Choose a reason for hiding this comment

dcherian Sep 23, 2024

Choose a reason for hiding this comment

hendrikmakait Sep 23, 2024

Choose a reason for hiding this comment

dcherian Sep 23, 2024

Choose a reason for hiding this comment

hendrikmakait commented Oct 4, 2024

jrbourbeau left a comment

Choose a reason for hiding this comment

dcherian Sep 20, 2024 •

edited

Loading

hendrikmakait Sep 20, 2024 •

edited

Loading

hendrikmakait Sep 20, 2024 •

edited

Loading