-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow HDF Groups #424
Allow HDF Groups #424
Changes from 3 commits
91939f4
183d126
2075162
b388d95
2cfd218
bbd4105
ecc5f49
a33afb2
9280b26
b4fc62d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
import base64 | ||
import io | ||
import logging | ||
from typing import Union, BinaryIO | ||
|
||
|
@@ -47,7 +48,7 @@ class SingleHdf5ToZarr: | |
to BinaryIO is optional), in which case must also provide url. If a str, | ||
file will be opened using fsspec and storage_options. | ||
url : string | ||
URI of the HDF5 file, if passing a file-like object | ||
URI of the HDF5 file, if passing a file-like object or h5py File/dataset | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
spec : int | ||
The version of output to produce (see README of this repo) | ||
inline_threshold : int | ||
|
@@ -69,11 +70,15 @@ class SingleHdf5ToZarr: | |
This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper | ||
to write out parquet as the references get filled, or some other dictionary-like class | ||
to customise how references get stored | ||
var_pattern: str or None | ||
If set, only variables with names matching this pattern (as regex) will be scanned | ||
and included in this output. It is on the caller to ensure that all the coordinates | ||
needed to represent a data variable are included. | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
def __init__( | ||
self, | ||
h5f: "BinaryIO | str", | ||
h5f: "BinaryIO | str | h5py.File", | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
url: str = None, | ||
spec=1, | ||
inline_threshold=500, | ||
|
@@ -89,8 +94,15 @@ def __init__( | |
fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {})) | ||
self.input_file = fs.open(path, "rb") | ||
url = h5f | ||
else: | ||
self._h5f = h5py.File(self.input_file, mode="r") | ||
elif isinstance(h5f, io.IOBase): | ||
self.input_file = h5f | ||
self._h5f = h5py.File(self.input_file, mode="r") | ||
else: | ||
# assume h5py object (File or group/dataset) | ||
self._h5f = h5f | ||
fs, path = fsspec.core.url_to_fs(url, **(storage_options or {})) | ||
self.input_file = fs.open(path, "rb") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you need these two lines anymore (they certainly mess up my used case where the file is an S3 object), since the file is loaded as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. _h5f is indeed set to the input two lines above. This exists for any inlining that might happen, which requires getting bytes directly from the original file, not going via h5py.
What happens? I think providing the URL/options will certainly be required. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in my case it's looking for a local file even if I pass valid S3 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The urls starts with "s3://"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes and no 🤣 It's a very peculariar bucket, the storage options dict that s3fs recognizes is
the call to fs = s3fs.S3FileSystem(**storage_options)
with fs.open(file_url, 'rb') as s3file:
... but There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is definitely not the right URL: the first part should be the bucket, not a server name (I'm surprised it even attempts to connect). The URL should be "s3://bnl/da193a_25_day__198807-198807.nc", as the server/endpoint is already included in the storage options. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. blast! That worked! I knew I'm not doing something right 😆 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. though am getting fairly long times from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah that's because this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (oops, fixed) |
||
self.spec = spec | ||
self.inline = inline_threshold | ||
if vlen_encode not in ["embed", "null", "leave", "encode"]: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @martindurant many thanks for looking into this! Great minds think alike - this is how I made it ingest my subset of the multi-variate file myself, earlier today, on a scratch dev version of Kerchunk in my env: I passed an already extracted
h5py.Group
object. The only hitch with this approach is that if one passes ah5py.Dataset
instead, Kerchunk (well,h5py
in reality) will complain sincevisititems
is not a valid method of aDataset
but only ofFile
orGroup
objects, so in my case, I constructed an empty group where I plopped theDataset
of interest. The issue with that approach is that one needs to name the newGroup
something else than theDataset
, hence introducing some extra unwanted overheadThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to put it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes I made to Kerchunk are literally the ones you did here (including passing the variable name and looking for it), so it's not much done on the Kerchunk side, most of the other stuff (creating the new Group etc) I did at our end, but if you think that's useful, I'll plop in Kerchunk, no problemo. I still think it's a bit ugly TBF 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so we need something to cope with Dataset Vs File, maybe just put the diff in here? Yes, I think it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really, just a peasanty workround to get Kerchunk to be able to run
visititems(callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good man! That's exactly the thing. I'll post them up tomorrow, have not committed them off my work machine yet, and am home now, dinner time 🍕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @martindurant here is my approach in my package (PyActiveStorage):
and the bit changed in
kerchunk/hdf.py
is pretty much all you did here, with the added bit that the object becomes just the Group I want to get kerchunked, so intranslate()
I plopped this hacky bit:Cheers 🍺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple more details: I am using kerchunk==0.2.0 in my conda/mamba env (installed from conda-forge), so I can bypass the dep issue with pinned numcodecs, and here are some timing results of this approch (changed Kerchunk + conversion to Group and limiting kerchunking to it) vs bogstandard Kerchunking my entire file (that has some 100 variables, with all manners of dimesnions, but the bigger ones are shape (30, 30, 350, 420)):
Kerchunking on a restricted space does indeed improve timings, order factor of 2 it appears in my particular test case 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the JSON file containing the Kerchunk indices/Zarr ref file data drops from 300k normal to 8k when I do the restricted approach (this would matter if we were in 1992, though 🤣 )