Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow HDF Groups #424

Merged
merged 10 commits into from
Feb 29, 2024
18 changes: 15 additions & 3 deletions kerchunk/hdf.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import base64
import io
import logging
from typing import Union, BinaryIO

Expand Down Expand Up @@ -47,7 +48,7 @@ class SingleHdf5ToZarr:
to BinaryIO is optional), in which case must also provide url. If a str,
file will be opened using fsspec and storage_options.
url : string
URI of the HDF5 file, if passing a file-like object
URI of the HDF5 file, if passing a file-like object or h5py File/dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @martindurant many thanks for looking into this! Great minds think alike - this is how I made it ingest my subset of the multi-variate file myself, earlier today, on a scratch dev version of Kerchunk in my env: I passed an already extracted h5py.Group object. The only hitch with this approach is that if one passes a h5py.Dataset instead, Kerchunk (well, h5py in reality) will complain since visititems is not a valid method of a Dataset but only of File or Group objects, so in my case, I constructed an empty group where I plopped the Dataset of interest. The issue with that approach is that one needs to name the new Group something else than the Dataset, hence introducing some extra unwanted overhead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to put it in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes I made to Kerchunk are literally the ones you did here (including passing the variable name and looking for it), so it's not much done on the Kerchunk side, most of the other stuff (creating the new Group etc) I did at our end, but if you think that's useful, I'll plop in Kerchunk, no problemo. I still think it's a bit ugly TBF 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so we need something to cope with Dataset Vs File, maybe just put the diff in here? Yes, I think it's useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really, just a peasanty workround to get Kerchunk to be able to run visititems(callback)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good man! That's exactly the thing. I'll post them up tomorrow, have not committed them off my work machine yet, and am home now, dinner time 🍕

Copy link
Contributor

@valeriupredoi valeriupredoi Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @martindurant here is my approach in my package (PyActiveStorage):

    elif storage_type == "s3" and storage_options is not None:
        storage_options = storage_options.copy()
        storage_options['default_fill_cache'] = False
        # storage_options['default_cache_type'] = "none"  # big time drain this one
        fs = s3fs.S3FileSystem(**storage_options)
        fs2 = fsspec.filesystem('')
        with fs.open(file_url, 'rb') as s3file:
            s3file = h5py.File(s3file, mode="w")
            if isinstance(s3file[varname], h5py.Dataset):
                print("Looking only at a single Dataset", s3file[varname])
                s3file.create_group(varname + " ")
                s3file[varname + " "][varname] = s3file[varname]
            elif isinstance(s3file[varname], h5py.Group):
                print("Looking only at a single Group", s3file[varname])
                s3file = s3file[varname]
            h5chunks = SingleHdf5ToZarr(s3file, file_url, var=varname,
                                        inline_threshold=0)

and the bit changed in kerchunk/hdf.py is pretty much all you did here, with the added bit that the object becomes just the Group I want to get kerchunked, so in translate() I plopped this hacky bit:

        if self.var and self._h5f[self.var + " "]:
            self._h5f = self._h5f[self.var + " "]
        print("Visiting the following object", self._h5f)
        self._h5f.visititems(self._translator)

Cheers 🍺

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple more details: I am using kerchunk==0.2.0 in my conda/mamba env (installed from conda-forge), so I can bypass the dep issue with pinned numcodecs, and here are some timing results of this approch (changed Kerchunk + conversion to Group and limiting kerchunking to it) vs bogstandard Kerchunking my entire file (that has some 100 variables, with all manners of dimesnions, but the bigger ones are shape (30, 30, 350, 420)):

With changed Kerchunk + conversion Dataset to Group
---------------------------------------------------
Visititems took: 2.5403971672058105
Time to Translate and Dump Kerchunks to json file 4.393939018249512
Visititems took: 1.9200255870819092
Time to Translate and Dump Kerchunks to json file 2.7312347888946533
Visititems took: 2.005722761154175
Time to Translate and Dump Kerchunks to json file 2.588365316390991
Visititems took: 1.9823436737060547
Time to Translate and Dump Kerchunks to json file 2.7559237480163574
Visititems took: 1.9835329055786133
Time to Translate and Dump Kerchunks to json file 2.5909011363983154

With regular Kerchunk
---------------------
Visititems took: 4.841791152954102
Time to Translate and Dump Kerchunks to json file 5.548096656799316
Visititems took: 4.454912900924683
Time to Translate and Dump Kerchunks to json file 5.720059156417847
Visititems took: 3.8621530532836914
Time to Translate and Dump Kerchunks to json file 4.593475580215454
Visititems took: 4.457882881164551
Time to Translate and Dump Kerchunks to json file 5.079823732376099
Visititems took: 4.275482177734375
Time to Translate and Dump Kerchunks to json file 4.894218444824219

Kerchunking on a restricted space does indeed improve timings, order factor of 2 it appears in my particular test case 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the JSON file containing the Kerchunk indices/Zarr ref file data drops from 300k normal to 8k when I do the restricted approach (this would matter if we were in 1992, though 🤣 )

martindurant marked this conversation as resolved.
Show resolved Hide resolved
spec : int
The version of output to produce (see README of this repo)
inline_threshold : int
Expand All @@ -69,11 +70,15 @@ class SingleHdf5ToZarr:
This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper
to write out parquet as the references get filled, or some other dictionary-like class
to customise how references get stored
var_pattern: str or None
If set, only variables with names matching this pattern (as regex) will be scanned
and included in this output. It is on the caller to ensure that all the coordinates
needed to represent a data variable are included.
martindurant marked this conversation as resolved.
Show resolved Hide resolved
"""

def __init__(
self,
h5f: "BinaryIO | str",
h5f: "BinaryIO | str | h5py.File",
martindurant marked this conversation as resolved.
Show resolved Hide resolved
url: str = None,
spec=1,
inline_threshold=500,
Expand All @@ -89,8 +94,15 @@ def __init__(
fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
self.input_file = fs.open(path, "rb")
url = h5f
else:
self._h5f = h5py.File(self.input_file, mode="r")
elif isinstance(h5f, io.IOBase):
self.input_file = h5f
self._h5f = h5py.File(self.input_file, mode="r")
else:
# assume h5py object (File or group/dataset)
self._h5f = h5f
fs, path = fsspec.core.url_to_fs(url, **(storage_options or {}))
self.input_file = fs.open(path, "rb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need these two lines anymore (they certainly mess up my used case where the file is an S3 object), since the file is loaded as File object up in the first branch of the conditional, if h5f is an h5py.Group then it should be kept that way with self._h5f set to it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_h5f is indeed set to the input two lines above. This exists for any inlining that might happen, which requires getting bytes directly from the original file, not going via h5py.

mess up my use case

What happens? I think providing the URL/options will certainly be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my case it's looking for a local file even if I pass valid S3 storage_options - leave it like this for now, I'll need to do a wee bit more testing to understand what's going on, and will get back to you if Kerchunk needs changing 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urls starts with "s3://"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and no 🤣 It's a very peculariar bucket, the storage options dict that s3fs recognizes is

{'key': 'xxxx', 'secret': "xxxx", 'client_kwargs': {'endpoint_url': 'https://uor-aces-o.s3-ext.jc.rl.ac.uk'}, 'default_fill_cache': False, 'default_cache_type': 'first'}

the call to s3fs to able to read such a strange bucket is as follows:

fs = s3fs.S3FileSystem(**storage_options)
with fs.open(file_url, 'rb') as s3file:
...

but file_url needs to be the truncated (bucket + file-name) ie bnl/da193a_25_day__198807-198807.nc in this case, and s3fs is assembling its full URL via the endpoint URL and that truncated bucket _ filename - it's odd, not 100% sure why this type of s3 storage wants that configuration, but bottom line is in the case of Kerchunk trying to open it as a regular s3 file it's not working - even if I prepend a correct full s3://...path to the file, I get Forbidden access since the storage identification is done wrongly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3://uor-aces-o.s3-ext.jc.rl.ac.uk/bnl/da193a_25_day__198807-198807.nc

This is definitely not the right URL: the first part should be the bucket, not a server name (I'm surprised it even attempts to connect). The URL should be "s3://bnl/da193a_25_day__198807-198807.nc", as the server/endpoint is already included in the storage options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blast! That worked! I knew I'm not doing something right 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though am getting fairly long times from visititems() - very much comparable times to the ones where there is no kerchunking done on a single Group, but rather, on the entire file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's because this self._h5f = h5py.File(self.input_file, mode="r") is a few lines down 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops, fixed)

self.spec = spec
self.inline = inline_threshold
if vlen_encode not in ["embed", "null", "leave", "encode"]:
Expand Down
Loading