Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multiscale use-case #23

Closed
joshmoore opened this issue May 8, 2019 · 23 comments
Closed

WIP: Multiscale use-case #23

joshmoore opened this issue May 8, 2019 · 23 comments

Comments

@joshmoore
Copy link
Member

Motivation

In imaging applications, especially interactive ones, the usability of a data array is greatly increased by having pre-computed sub-resolutions of the array. For example, an array of size (10**5, 10**5) might have halving-steps pre-computed, providing arrays of sizes 5000, 2500, 1250, 625, 312 etc. Users can quickly load a low-resolution representation to choose which regions are worth loading in higher- or even full- resolution. A few examples of this trend in imaging file formats are provided under Related Reading.

The current zarr spec has the following issues when trying to naively specify such sub-resolutions:

  • Arrays of differing size can only represent the individual resolution by naming convention
    ("Reslolution_0", "Resolution_1", etc.) This issue exists in a number of existing formats.
  • Storing data of differing dimensions in the same chunk is not intended.
  • Even if data of differing dimensions (compression)

Generalization

In other domains, a generalization of this functionality might enable "summary data" to be stored,
where along a given dimension a function has been applied, e.g. averaging. This is usually most
beneficial when the function is sufficiently time-costly that its worth trading storage for speed.

Potential implementations

Filter / Memory-layout

Each chunk could be passed to a function which stores or reads the multiscale representation
with a given chunk. (TBD)

Array relationships

Metadata on a given array could specify one or both inheritance relationships to other arrays.
For example, if a child array link to its parent, it might store the following metadata:

{
    "summary_of": {
        "key": "Resolution_0",
        "method": "halving",
        "dimensions": [0, 1]
    }
}

One issue with only having the parent relationship defined is how one determines the lowest
resolution. The child relationships could be represented with:

{
    "summarized_by": [
        {
            "key": "Resolution_1",
            "method": "having",
            "dimensions": [0, 1]
        }, ...

    ]
}

but this would require updating source arrays when creating a summary.

An alternative would be to provide a single source of metadata on the relationships between arrays.

Related reading

Possible synonyms / Related concepts

  • Global lossy compression
  • Progressive compression
  • Pyramidal images
  • Sub-resolutions
  • Summary views
@joshmoore joshmoore changed the title Multiscale use-case (WIP) WIP: Multiscale use-case May 8, 2019
@rabernat
Copy link
Contributor

rabernat commented May 8, 2019

This is a good idea. Also probably worth looking at Cloud Optimized Geotiff:
https://www.cogeo.org/

COGs store multiscale imagery in a cloud-optimized way.

@forman
Copy link

forman commented May 9, 2019

Very good idea!

We started developing a similar feature in our project xcube which provides a CLI command xcube level <zarr-cube-dataset>. It turns Zarr data cubes into a directory containing a spatial multi-resolution pyramid with chunking tailored for image tile processing.

There is also a xcube serve command that starts a WMTS server and a viewer that exploits the leveled datasets: xcube-viewer.

I'm happy to contribute to specs and implementations.

@rabernat
Copy link
Contributor

rabernat commented May 9, 2019

@forman - xcube looks like a fascinating project, potentially of very broad interest. Could I convince you to give us a brief presentation about it at an upcoming Pangeo weekly call? (http://pangeo.io/meeting-notes.html)

@joshmoore
Copy link
Member Author

I'd be happy to join that as well.

@forman
Copy link

forman commented May 9, 2019

@rabernat Sure, thanks! May 29 should fit. Next week we are at the ESA Living Planet Symposium, maybe someone of you guys is there too?

@jakirkham
Copy link
Member

jakirkham commented May 10, 2019

cc-ing @jni @sofroniewn, as this issue and xcube may be of interest. 😉

@sofroniewn
Copy link

Thanks @jakirkham yes I'm highly interested in a feature like this and have had some chats with @joshmoore about it before, so glad to see an issue was made. The xcube project looks very interesting too, so also thanks for pointing me towards it.

My current use is viewing large multi-resolution pathology images, and as @joshmoore mentioned I had to adopt my own naming conventions for each layer, and saw a large increase in the total file size relative to an optimized tiff format (I was working with data from the Cameylon16 challenge) and visualizing it with napari.

@alimanfoo
Copy link
Member

alimanfoo commented May 13, 2019 via email

@rabernat
Copy link
Contributor

@forman - thanks for agreeing! I will put you on the agenda for May 29 and follow up closer to the meeting date..

@sofroniewn
Copy link

I'd be interested to know if you have any sense of why file size increased,
i.e., what is tiff doing differently?

@alimanfoo I'm not quite sure. I tried to avoid anything foolish, like changing the dtype, but I saw one link imply some of these multiresolution tiffs use jpeg2000 compression which then doesn't quite make it a fair comparison! Also the data from the tiff looks like it has some compression artifacts in it, so they have done something to it at some point.

Here's a link to a google drive that has the tiffs in it and a README.md describing the data. The file training/tumor/tumor_001.tif is a fine example if you wanted to take a look.

I'm really not sure if they have done any clever optimization across layers of the pyramid or if they just treat them all independently and are using the JPEG2000 compression on each one. Sorry for not being more helpful!

@alimanfoo
Copy link
Member

alimanfoo commented May 14, 2019 via email

@jakirkham
Copy link
Member

This line from Wikipedia suggests JPEG2000 is optimized for handling different scales.

The codestream obtained after compression of an image with JPEG 2000 is scalable in nature, meaning that it can be decoded in a number of ways; for instance, by truncating the codestream at any point, one may obtain a representation of the image at a lower resolution, or signal-to-noise ratio – see scalable compression.

Maybe there is more to the story. That said, a good first step might be to design a codec for working with JPEG 2000 compression.

@alimanfoo
Copy link
Member

alimanfoo commented May 14, 2019 via email

@meggart
Copy link
Member

meggart commented May 16, 2019

I wonder if you could do something like encode chunks of an array with a JPEG2000 codec,

It looks like the codec would need some information on the shape of the data that it is going to compress to correctly resemble the spatial structure when compressing. AFAIK so far a codec could treat data just as a stream of bytes, but that should not be a major problem. Also I think that JPEG2000 only works on 2D-data so when compressing one would have to highlight the spatial dimensions to the compressor.

but then have multiple "virtual" zarr arrays which all share the same chunk objects, but which read the chunks to different resolutions. Not sure if that makes sense.

Wouldn't this be inefficient for object stores, because although you decode only a part of the stream
you would have to load the full uncompressed chunk first? There might be image codecs out there where you can store the compression in separate files, where the compression levels might resemble a new array axis along which more and more details are added to the image. Don't know if this makes sense.

A very simple example for e.g. a time series would be to store the FFT of the time series and have the lowest frequencies in the first chunk, intermediate frequencies in the second and high frequencies in the third chunk. When reading back the data, depending on how many values you read from the frequency axis, you will get a more or less detailed time series.

So it would be good to find out if a JPEG2000 encoded image can be split across several chunks, or if there are similar codecs which could do that.

@meggart
Copy link
Member

meggart commented May 17, 2019

Ok, if my last reply was too confusing, this gist conceptually shows what I would suggest https://gist.github.com/meggart/0f85bac03c66e321054288b121e423b5

The JPEG2000 equivalent would then be to split an encoded JPEG stream into chunks and store the chunks along a new axis. This way one could store the data without being redundant and still access data at low resolution without touching the high-res chunks.

@joshmoore
Copy link
Member Author

@jakirkham : by truncating the codestream at any point, one may obtain a representation of the image at a lower resolution, or signal-to-noise ratio – see scalable compression.

This was one of my driving motivations for getting this opened. I imagine we could do something at the chunk-level, but it's unclear to me how zarr could do this at the array level (which is where the "global lossy compression" moniker is coming from)

@jakirkham : a good first step might be to design a codec for working with JPEG 2000 compression.

Another issue is going to be the 2D-ness of JPEG200, as @meggart says.

@alimanfoo : then have multiple "virtual" zarr arrays which all
share the same chunk objects, but which read the chunks to different
resolutions.

I don't know if it makes sense either, but I'm hoping one of you guys would have one or more brilliant ideas 😉

@meggart : So it would be good to find out if a JPEG2000 encoded image can be split across several chunks, or if there are similar codecs which could do that.

Chatting with @melissalinkert et al., OME-TIFF currently will only either compress a whole 2D plane getting the benefit of JPEG2000 or individual tiles, which then can't take advantage of the global context.

@jakirkham
Copy link
Member

To @joshmoore and @meggart's points, I think JPEG 2000 is just using wavelet transforms to achieve multiresolution compression. So one can probably dig into this a bit more to identify an exact algorithm that works well.

Independently it would be valuable to setup a codec based on JPEG 2000 (as this is something we can use today). This would help motivate work for handling the chunks (even in the limited 2D case). Imagine what we learn here is useful for ND once the right codec shows up. 😉

@forman
Copy link

forman commented May 31, 2019

@forman - thanks for agreeing! I will put you on the agenda for May 29 and follow up closer to the meeting date..

@rabernat Sorry, I couldn't make it last Wednesday. We still definitely want to join one of the upcoming meetings. I'll let you know...

@joshmoore
Copy link
Member Author

joshmoore commented Jul 17, 2019

@mmorehea
Copy link

mmorehea commented Dec 2, 2019

Hi everyone,

We're very interested in zarr / n5 for visualizing large volumes. I spent a few hours today reading zarr documentation and putting together this toy example for creating optional pyramidal layers. As it's my first day with zarr, not everything is ideal, so zarr experts can probably do this way better.

https://github.com/mmorehea/zarrPyramid/blob/master/zarrPyramid.py

After writing it, I came here to find there have already been some great suggestions (xcube, @meggart 's pyramid code, etc) that do the work better, doh. Happy to support any variant that the zarr developers decide on. Cheers!

@jakirkham
Copy link
Member

@sofroniewn, is this a good place for people to look at Napari's implementation ( napari/napari#545 )?

@joshmoore
Copy link
Member Author

After various recent discussions, I've opened #50 to propose an initial format for multiscale arrays.

@joshmoore
Copy link
Member Author

After #50 and now #125 I think this can be safely closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants