Support HDF4? #216

TomNicholas · 2024-08-07T19:38:46Z

Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) @jgallagher59701 mentioned that DMR++ can (or soon will) support it.

I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.

If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests (see #85), then presumably a reader for HDF4 directly to chunk manifests would also be possible?

cc @ayushnag @betolink

jgallagher59701 · 2024-08-08T16:29:40Z

On Aug 7, 2024, at 13:39, Tom Nicholas ***@***.***> wrote: Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) <#85 (comment)> @jgallagher59701 <https://github.com/jgallagher59701> mentioned that DMR++ can (or soon will) support it.

We use the same code to interpret the DMR++ for HDF5 and HDF4.

I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code. If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests, then presumably a reader for HDF4 directly to chunk manifests would also be possible?

Yes. There’s quite a bit to HDF4, however, because it is a more complex format than HDF5. And, NASA’s HDF4 is not vanilla HDF4, so it has its own complexities on top of that. Bottom line, you will probably have to extend the interpreter you have, but it’s certainly possible and there is lots of data in HDF4. HTH, James

…

cc @ayushnag <https://github.com/ayushnag> @betolink <https://github.com/betolink> — Reply to this email directly, view it on GitHub <#216>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB7Q4KVNGFPHGNACJJUXNKTZQJZVZAVCNFSM6AAAAABMFBYFYOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TIMRQGM4TQOA>. You are receiving this because you were mentioned.

-- James Gallagher ***@***.***

TomNicholas · 2024-08-20T14:46:03Z

@martindurant has an in-progress PR to kerchunk to add support for reading HDF4 directly. If that makes it in we can just call it from vz.open_virtual_dataset, which would fully close this issue.

martindurant · 2024-08-20T14:59:16Z

I should warn you, that I am working to match only specific NASA data (provided by @maxrjones ), not HDF4 in general, and I suspect that the chunks in general may be tiny.

jgallagher59701 · 2024-08-20T18:20:49Z

Older data in HDF4/5 almost always has small chunks (spinning disks, low-latency, small block sizes). But that is not a big problem. Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. We call these grouped chunks 'Super Chunks.' It is an optimization that Patrick Quinn first implemented and we stumbled on later. This is far more efficient than transferring the small chunks in parallel (in general, exceptions exist).

martindurant · 2024-08-20T18:25:25Z

Yes, kerchunk also joins near-contiguous chunks; the problem I actually see

the large number of references means relatively big reference stores
relatively small gains for reading only select chunks compared to grabbing the whole file every time.

TomNicholas · 2024-08-20T18:27:55Z

Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel.

Yes, kerchunk also joins near-contiguous chunks

This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?

jgallagher59701 · 2024-08-20T18:29:58Z

Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel.
Yes, kerchunk also joins near-contiguous chunks

This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?

I mean concatenating the byte ranges. Often in these files the chunks lie right next to each other (for a given array).

jgallagher59701 · 2024-08-20T18:32:13Z

Yes, kerchunk also joins near-contiguous chunks; the problem I actually see
...

relatively small gains for reading only select chunks compared to grabbing the whole file every time.

That's true for files with a small number of variables. Get the whole file. If there are O(10^2) variables and only 2-3 are needed, it's faster to get just those 2-3. Again, there are exceptions.

martindurant · 2024-08-20T18:33:00Z

In ReferenceFS, if you cat() with a number of references, those within a single file may be merged depending on the arguments

        max_gap=64_000,
        max_block=256_000_000,

For example, for references [remote://file, 10, 10] , [remote://file, 30, 10], the actual request will be bytes 10->40, if the gap is smaller than max_gap. The result is sliced into two outputs.
Naturally, if max_gap=0, only truly contiguous parts are merged, and <0 for no merge at all. The requests would still be concurrent, however.

mdsumner · 2024-08-20T21:15:00Z

That said, we have a full interpreter in C++, which i realize is not exactly enticing for many

Why aren't we using DMR++? Is it not in good enough shape to bind to Python/R? Is there other challenges, there's plenty of C++ used seamlessly in Python and calling out to h5 libs is doing that anyway.

That sounds like the crosslang solution already ?? I only have a few HDF4 stores of interest outside of NASA, and maybe only one.

mdsumner · 2024-08-20T21:20:27Z

There's something I'm missing given #113 🙏 I'll keep exploring I keep finding new aspects 👌.

martindurant · 2024-08-21T13:25:05Z

I'm sorry if I have done some duplication of work. I think it may be worthwhile to have a pure-python solution too, though, for the case that no dmr++ index files exist for some HDF4. Also, it has been (so far) nerdy fun, definitely work a blog post.

maxrjones · 2024-08-21T16:31:14Z

https://github.com/fhs/pyhdf/ also reads HDF4 and SatPy uses it to read MODIS. I'm wondering if it could be helpful for Kerchunk as well.

jgallagher59701 · 2024-08-22T19:00:16Z

I wonder if Ayush'd work on VirtualiZarr has a DMR++ parser (pure python) you could use? The DMR++ builder is C++ but we actually have a DMR++ Builder web service that we can expose for HDF5 and could do the same thing for HDF4.

It would be interesting to see how close we could get to valid Kerchunk from DMR++ using a simple transform. Just a thought, I don't see myself having time for that any time soon...

ayushnag · 2024-08-22T19:08:37Z

My code mostly extracts the necessary zarr metadata and then creates it into a virtualizarr data structure at the end of each function. So by just modifying the last step creating a kerchunk reader is definitely possible. Also interestingly you could go dmrpp --> virtualizarr --> kerchunk since virtualizarr supports writing out to kerchunk.

However I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4

martindurant · 2024-08-22T19:11:19Z

I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4

Is there no hdf4 work? It is very different.

ayushnag · 2024-08-22T19:19:36Z

No there isn't any hdf4 work yet. However it seems like the goal is to make the hdf4 dmrpp spec very similar to the hdf5 one which means it will require some sort of extension (as opposed to a rewrite) as James mentioned above:

Bottom line, you will probably have to extend the interpreter you have

martindurant · 2024-08-22T19:52:32Z

My HDF4 branch in kerchunk is very nearly complete. Everyone welcome to look!

As for pyhdf4..., to use it, you need to have a very deep understanding of the specifics of the conventions used in a given file (maybe possible for modis) and how the C API works. If I can make my version work, I prefer pure-python.

betolink · 2024-08-22T20:53:36Z

Is this code?: https://github.com/martindurant/fsspec-reference-maker/blob/df61060869e367da9674d33962631d81ead76865/kerchunk/hdf.py#L697 seeing terms like "SDD" gave me flashbacks of the first time I opened one of these files. Thanks for all the work! can we just throw some examples at it?

martindurant · 2024-08-22T20:56:16Z

Yes, that code. Please do play with it, but of course there are no guarantees.

TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files labels Aug 7, 2024

TomNicholas mentioned this issue Aug 8, 2024

Listing every format that could be represented as virtual zarr #218

Open

14 tasks

TomNicholas mentioned this issue Aug 26, 2024

Improvements to the DMR++ parser #230

Open

martindurant mentioned this issue Aug 27, 2024

kerchunk interop pytroll/satpy#2889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support HDF4? #216

Support HDF4? #216

TomNicholas commented Aug 7, 2024 •

edited

Loading

jgallagher59701 commented Aug 8, 2024 via email

TomNicholas commented Aug 20, 2024 •

edited

Loading

martindurant commented Aug 20, 2024

jgallagher59701 commented Aug 20, 2024

martindurant commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

jgallagher59701 commented Aug 20, 2024

jgallagher59701 commented Aug 20, 2024

martindurant commented Aug 20, 2024

mdsumner commented Aug 20, 2024 •

edited

Loading

mdsumner commented Aug 20, 2024

martindurant commented Aug 21, 2024

maxrjones commented Aug 21, 2024

jgallagher59701 commented Aug 22, 2024

ayushnag commented Aug 22, 2024

martindurant commented Aug 22, 2024

ayushnag commented Aug 22, 2024

martindurant commented Aug 22, 2024

betolink commented Aug 22, 2024 •

edited

Loading

martindurant commented Aug 22, 2024

Support HDF4? #216

Support HDF4? #216

Comments

TomNicholas commented Aug 7, 2024 • edited Loading

jgallagher59701 commented Aug 8, 2024 via email

TomNicholas commented Aug 20, 2024 • edited Loading

martindurant commented Aug 20, 2024

jgallagher59701 commented Aug 20, 2024

martindurant commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

jgallagher59701 commented Aug 20, 2024

jgallagher59701 commented Aug 20, 2024

martindurant commented Aug 20, 2024

mdsumner commented Aug 20, 2024 • edited Loading

mdsumner commented Aug 20, 2024

martindurant commented Aug 21, 2024

maxrjones commented Aug 21, 2024

jgallagher59701 commented Aug 22, 2024

ayushnag commented Aug 22, 2024

martindurant commented Aug 22, 2024

ayushnag commented Aug 22, 2024

martindurant commented Aug 22, 2024

betolink commented Aug 22, 2024 • edited Loading

martindurant commented Aug 22, 2024

TomNicholas commented Aug 7, 2024 •

edited

Loading

TomNicholas commented Aug 20, 2024 •

edited

Loading

mdsumner commented Aug 20, 2024 •

edited

Loading

betolink commented Aug 22, 2024 •

edited

Loading