WIP: Multiscale use-case #23

joshmoore · 2019-05-08T13:52:37Z

Motivation

In imaging applications, especially interactive ones, the usability of a data array is greatly increased by having pre-computed sub-resolutions of the array. For example, an array of size (10**5, 10**5) might have halving-steps pre-computed, providing arrays of sizes 5000, 2500, 1250, 625, 312 etc. Users can quickly load a low-resolution representation to choose which regions are worth loading in higher- or even full- resolution. A few examples of this trend in imaging file formats are provided under Related Reading.

The current zarr spec has the following issues when trying to naively specify such sub-resolutions:

Arrays of differing size can only represent the individual resolution by naming convention
("Reslolution_0", "Resolution_1", etc.) This issue exists in a number of existing formats.
Storing data of differing dimensions in the same chunk is not intended.
Even if data of differing dimensions (compression)

Generalization

In other domains, a generalization of this functionality might enable "summary data" to be stored,
where along a given dimension a function has been applied, e.g. averaging. This is usually most
beneficial when the function is sufficiently time-costly that its worth trading storage for speed.

Potential implementations

Filter / Memory-layout

Each chunk could be passed to a function which stores or reads the multiscale representation
with a given chunk. (TBD)

Array relationships

Metadata on a given array could specify one or both inheritance relationships to other arrays.
For example, if a child array link to its parent, it might store the following metadata:

{
    "summary_of": {
        "key": "Resolution_0",
        "method": "halving",
        "dimensions": [0, 1]
    }
}

One issue with only having the parent relationship defined is how one determines the lowest
resolution. The child relationships could be represented with:

{
    "summarized_by": [
        {
            "key": "Resolution_1",
            "method": "having",
            "dimensions": [0, 1]
        }, ...

    ]
}

but this would require updating source arrays when creating a summary.

An alternative would be to provide a single source of metadata on the relationships between arrays.

Possible synonyms / Related concepts

Global lossy compression
Progressive compression
Pyramidal images
Sub-resolutions
Summary views

The text was updated successfully, but these errors were encountered:

rabernat · 2019-05-08T14:20:56Z

This is a good idea. Also probably worth looking at Cloud Optimized Geotiff:
https://www.cogeo.org/

COGs store multiscale imagery in a cloud-optimized way.

forman · 2019-05-09T11:53:52Z

Very good idea!

We started developing a similar feature in our project xcube which provides a CLI command xcube level <zarr-cube-dataset>. It turns Zarr data cubes into a directory containing a spatial multi-resolution pyramid with chunking tailored for image tile processing.

There is also a xcube serve command that starts a WMTS server and a viewer that exploits the leveled datasets: xcube-viewer.

I'm happy to contribute to specs and implementations.

rabernat · 2019-05-09T13:46:48Z

@forman - xcube looks like a fascinating project, potentially of very broad interest. Could I convince you to give us a brief presentation about it at an upcoming Pangeo weekly call? (http://pangeo.io/meeting-notes.html)

joshmoore · 2019-05-09T13:48:04Z

I'd be happy to join that as well.

forman · 2019-05-09T14:09:08Z

@rabernat Sure, thanks! May 29 should fit. Next week we are at the ESA Living Planet Symposium, maybe someone of you guys is there too?

jakirkham · 2019-05-10T00:16:37Z

cc-ing @jni @sofroniewn, as this issue and xcube may be of interest. 😉

sofroniewn · 2019-05-10T04:15:02Z

Thanks @jakirkham yes I'm highly interested in a feature like this and have had some chats with @joshmoore about it before, so glad to see an issue was made. The xcube project looks very interesting too, so also thanks for pointing me towards it.

My current use is viewing large multi-resolution pathology images, and as @joshmoore mentioned I had to adopt my own naming conventions for each layer, and saw a large increase in the total file size relative to an optimized tiff format (I was working with data from the Cameylon16 challenge) and visualizing it with napari.

alimanfoo · 2019-05-13T08:12:25Z

My current use is viewing large multi-resolution pathology images, and as @joshmoore <https://github.com/joshmoore> mentioned I had to adopt my own naming conventions for each layer, and saw a large increase in the total file size relative to an optimized tiff format.

I'd be interested to know if you have any sense of why file size increased, i.e., what is tiff doing differently?

rabernat · 2019-05-13T13:46:15Z

@forman - thanks for agreeing! I will put you on the agenda for May 29 and follow up closer to the meeting date..

sofroniewn · 2019-05-13T18:39:47Z

I'd be interested to know if you have any sense of why file size increased,
i.e., what is tiff doing differently?

@alimanfoo I'm not quite sure. I tried to avoid anything foolish, like changing the dtype, but I saw one link imply some of these multiresolution tiffs use jpeg2000 compression which then doesn't quite make it a fair comparison! Also the data from the tiff looks like it has some compression artifacts in it, so they have done something to it at some point.

Here's a link to a google drive that has the tiffs in it and a README.md describing the data. The file training/tumor/tumor_001.tif is a fine example if you wanted to take a look.

I'm really not sure if they have done any clever optimization across layers of the pyramid or if they just treat them all independently and are using the JPEG2000 compression on each one. Sorry for not being more helpful!

alimanfoo · 2019-05-14T08:59:53Z

I'm really not sure if they have done any clever optimization across layers of the pyramid or if they just treat them all independently and are using the JPEG2000 compression on each one. Sorry for not being more helpful!

No worries, I guess that's what I was wondering. I.e., is the difference just down to different compression codecs, or is there some clever optimisation that leverages the fact that you have multiple resolutions of the same image.

jakirkham · 2019-05-14T15:02:29Z

This line from Wikipedia suggests JPEG2000 is optimized for handling different scales.

The codestream obtained after compression of an image with JPEG 2000 is scalable in nature, meaning that it can be decoded in a number of ways; for instance, by truncating the codestream at any point, one may obtain a representation of the image at a lower resolution, or signal-to-noise ratio – see scalable compression.

Maybe there is more to the story. That said, a good first step might be to design a codec for working with JPEG 2000 compression.

alimanfoo · 2019-05-14T16:02:14Z

I was just reading that too :-) From what I can glean, the idea is that you can render a lower res image from the first part of the stream, then increase resolution as you read further into the stream. At least that's part of it. I wonder if you could do something like encode chunks of an array with a JPEG2000 codec, but then have multiple "virtual" zarr arrays which all share the same chunk objects, but which read the chunks to different resolutions. Not sure if that makes sense.

…

On Tue, 14 May 2019, 16:02 jakirkham, ***@***.***> wrote: This line from Wikipedia <https://en.wikipedia.org/wiki/JPEG_2000#Aims_of_the_standard> suggests JPEG2000 is optimized for handling different scales. The codestream obtained after compression of an image with JPEG 2000 is scalable in nature, meaning that it can be decoded in a number of ways; for instance, by truncating the codestream at any point, one may obtain a representation of the image at a lower resolution, or signal-to-noise ratio – see scalable compression. Maybe there is more to the story. That said, a good first step might be to design a codec for working with JPEG 2000 compression. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23?email_source=notifications&email_token=AAFLYQQQFAUCCO75MXVNATDPVLIANA5CNFSM4HLR5BCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVLYVPQ#issuecomment-492276414>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFLYQV42GNLML6223HEXTLPVLIANANCNFSM4HLR5BCA> .

meggart · 2019-05-16T10:00:47Z

I wonder if you could do something like encode chunks of an array with a JPEG2000 codec,

It looks like the codec would need some information on the shape of the data that it is going to compress to correctly resemble the spatial structure when compressing. AFAIK so far a codec could treat data just as a stream of bytes, but that should not be a major problem. Also I think that JPEG2000 only works on 2D-data so when compressing one would have to highlight the spatial dimensions to the compressor.

but then have multiple "virtual" zarr arrays which all share the same chunk objects, but which read the chunks to different resolutions. Not sure if that makes sense.

Wouldn't this be inefficient for object stores, because although you decode only a part of the stream
you would have to load the full uncompressed chunk first? There might be image codecs out there where you can store the compression in separate files, where the compression levels might resemble a new array axis along which more and more details are added to the image. Don't know if this makes sense.

A very simple example for e.g. a time series would be to store the FFT of the time series and have the lowest frequencies in the first chunk, intermediate frequencies in the second and high frequencies in the third chunk. When reading back the data, depending on how many values you read from the frequency axis, you will get a more or less detailed time series.

So it would be good to find out if a JPEG2000 encoded image can be split across several chunks, or if there are similar codecs which could do that.

meggart · 2019-05-17T09:18:13Z

Ok, if my last reply was too confusing, this gist conceptually shows what I would suggest https://gist.github.com/meggart/0f85bac03c66e321054288b121e423b5

The JPEG2000 equivalent would then be to split an encoded JPEG stream into chunks and store the chunks along a new axis. This way one could store the data without being redundant and still access data at low resolution without touching the high-res chunks.

joshmoore · 2019-05-17T10:48:44Z

@jakirkham : by truncating the codestream at any point, one may obtain a representation of the image at a lower resolution, or signal-to-noise ratio – see scalable compression.

This was one of my driving motivations for getting this opened. I imagine we could do something at the chunk-level, but it's unclear to me how zarr could do this at the array level (which is where the "global lossy compression" moniker is coming from)

@jakirkham : a good first step might be to design a codec for working with JPEG 2000 compression.

Another issue is going to be the 2D-ness of JPEG200, as @meggart says.

@alimanfoo : then have multiple "virtual" zarr arrays which all
share the same chunk objects, but which read the chunks to different
resolutions.

I don't know if it makes sense either, but I'm hoping one of you guys would have one or more brilliant ideas 😉

@meggart : So it would be good to find out if a JPEG2000 encoded image can be split across several chunks, or if there are similar codecs which could do that.

Chatting with @melissalinkert et al., OME-TIFF currently will only either compress a whole 2D plane getting the benefit of JPEG2000 or individual tiles, which then can't take advantage of the global context.

jakirkham · 2019-05-18T01:11:45Z

To @joshmoore and @meggart's points, I think JPEG 2000 is just using wavelet transforms to achieve multiresolution compression. So one can probably dig into this a bit more to identify an exact algorithm that works well.

Independently it would be valuable to setup a codec based on JPEG 2000 (as this is something we can use today). This would help motivate work for handling the chunks (even in the limited 2D case). Imagine what we learn here is useful for ND once the right codec shows up. 😉

forman · 2019-05-31T14:58:04Z

@forman - thanks for agreeing! I will put you on the agenda for May 29 and follow up closer to the meeting date..

@rabernat Sorry, I couldn't make it last Wednesday. We still definitely want to join one of the upcoming meetings. I'll let you know...

joshmoore · 2019-07-17T11:27:39Z

seealso: https://github.com/higlass/higlass/wiki#processing-and-importing-data (slidedeck)

mmorehea · 2019-12-02T22:33:43Z

Hi everyone,

We're very interested in zarr / n5 for visualizing large volumes. I spent a few hours today reading zarr documentation and putting together this toy example for creating optional pyramidal layers. As it's my first day with zarr, not everything is ideal, so zarr experts can probably do this way better.

https://github.com/mmorehea/zarrPyramid/blob/master/zarrPyramid.py

After writing it, I came here to find there have already been some great suggestions (xcube, @meggart 's pyramid code, etc) that do the work better, doh. Happy to support any variant that the zarr developers decide on. Cheers!

jakirkham · 2019-12-04T20:17:32Z

@sofroniewn, is this a good place for people to look at Napari's implementation ( napari/napari#545 )?

joshmoore · 2020-03-05T13:21:33Z

After various recent discussions, I've opened #50 to propose an initial format for multiscale arrays.

joshmoore · 2022-04-08T07:18:45Z

After #50 and now #125 I think this can be safely closed.

joshmoore changed the title ~~Multiscale use-case (WIP)~~ WIP: Multiscale use-case May 8, 2019

meggart mentioned this issue May 9, 2019

Zarr spec for pyramid-type data xcube-dev/xcube#61

Open

joshmoore mentioned this issue May 22, 2019

Extension slicing: or what to do when things go wrong #33

Closed

alimanfoo mentioned this issue Nov 22, 2019

Interest in supporting image pyramid? zarr-developers/zarr-python#520

Closed

manzt mentioned this issue Feb 21, 2020

Some thoughts to get started hubmapconsortium-graveyard/img2zarr#1

Open

This was referenced Feb 27, 2020

Using JPEG2000 for chunk compression zarr-developers/numcodecs#73

Open

Protocol extensions #49

Open

joshmoore mentioned this issue Mar 5, 2020

Extension proposal: multiscale arrays v0.1 #50

Closed

davidbrochart mentioned this issue Jun 15, 2020

spec v3: progressive encoding #80

Open

forman mentioned this issue Aug 20, 2021

Support multi-level (pyramid) data type and format in filesystem data stores xcube-dev/xcube#505

Closed

joshmoore mentioned this issue Jan 21, 2022

Multiscale convention #125

Open

joshmoore closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multiscale use-case #23

WIP: Multiscale use-case #23

joshmoore commented May 8, 2019

rabernat commented May 8, 2019

forman commented May 9, 2019

rabernat commented May 9, 2019

joshmoore commented May 9, 2019

forman commented May 9, 2019

jakirkham commented May 10, 2019 •

edited

Loading

sofroniewn commented May 10, 2019

alimanfoo commented May 13, 2019 via email

rabernat commented May 13, 2019

sofroniewn commented May 13, 2019

alimanfoo commented May 14, 2019 via email

jakirkham commented May 14, 2019

alimanfoo commented May 14, 2019 via email

meggart commented May 16, 2019

meggart commented May 17, 2019

joshmoore commented May 17, 2019

jakirkham commented May 18, 2019

forman commented May 31, 2019 •

edited

Loading

joshmoore commented Jul 17, 2019 •

edited

Loading

mmorehea commented Dec 2, 2019

jakirkham commented Dec 4, 2019

joshmoore commented Mar 5, 2020

joshmoore commented Apr 8, 2022

WIP: Multiscale use-case #23

WIP: Multiscale use-case #23

Comments

joshmoore commented May 8, 2019

Motivation

Generalization

Potential implementations

Filter / Memory-layout

Array relationships

Related reading

Possible synonyms / Related concepts

rabernat commented May 8, 2019

forman commented May 9, 2019

rabernat commented May 9, 2019

joshmoore commented May 9, 2019

forman commented May 9, 2019

jakirkham commented May 10, 2019 • edited Loading

sofroniewn commented May 10, 2019

alimanfoo commented May 13, 2019 via email

rabernat commented May 13, 2019

sofroniewn commented May 13, 2019

alimanfoo commented May 14, 2019 via email

jakirkham commented May 14, 2019

alimanfoo commented May 14, 2019 via email

meggart commented May 16, 2019

meggart commented May 17, 2019

joshmoore commented May 17, 2019

jakirkham commented May 18, 2019

forman commented May 31, 2019 • edited Loading

joshmoore commented Jul 17, 2019 • edited Loading

mmorehea commented Dec 2, 2019

jakirkham commented Dec 4, 2019

joshmoore commented Mar 5, 2020

joshmoore commented Apr 8, 2022

jakirkham commented May 10, 2019 •

edited

Loading

forman commented May 31, 2019 •

edited

Loading

joshmoore commented Jul 17, 2019 •

edited

Loading