Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chunking multiple assets together in the time/band dimensions #106

Closed
gjoseph92 opened this issue Dec 16, 2021 · 0 comments · Fixed by #116
Closed

Support chunking multiple assets together in the time/band dimensions #106

gjoseph92 opened this issue Dec 16, 2021 · 0 comments · Fixed by #116

Comments

@gjoseph92
Copy link
Owner

Currently, stackstac is built around each STAC Asset being its own chunk in the dask array—the time and band dimensions always have a chunksize of 1.

However, there are cases where you might want to load multiple Assets in one chunk of the array. Most commonly, you'd do this when you have a huge graph, need to cut down on tasks, and can give up some granularity. Particularly, you might be happy to combine the time dimension into fewer chunks if you know you're doing a composite right away anyway. See microsoft/PlanetaryComputer#12 (comment) for a motivating example.

So let's support extending the chunksize= argument to stackstac.stack to take up to 4-tuples (time, band, y, x), so you can specify the chunking along all dimensions.

Note that this isn't #66 (though that could be a follow-on): we're not talking about flattening/pre-mosaicing the data. We'd still load every asset as usual, it's just that the chunks of the dask array might be (4, 2, Y, X) instead of always (1, 1, Y, X).

This should be done/considered as a part of #105.

Questions:

  • When a chunk contains multiple assets, should they be loaded serially, or in parallel? We could create our own internal threadpool, since most of the IO is not CPU-bound. However, because we have to duplicate the GDAL Dataset and file-descriptor per-thread, that might be expensive on memory. I suppose the runtime of T threads reading N assets is the same as T threads reading N / C assets, where each read takes C times longer. So probably in serial. Sure would be nice to just have an aiocogeo Reader for this 😁
  • How will combining multiple bands into a single chunk interplay with Support multi-band COGs #62?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant