-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Implement open_datatree
in BackendEntrypoint for preliminary DataTree support
#7437
Conversation
66ff624
to
77129f1
Compare
for more information, see https://pre-commit.ci
Holla when you want a review :) |
Optionally, it shall implement: | ||
|
||
- ``open_datatree`` method: it shall implement reading from file, variables | ||
decoding and it returns an instance of :py:class:`~datatree.DataTree`. | ||
It shall take in input at least ``filename_or_obj`` argument and | ||
``drop_variables`` keyword argument. | ||
For more details see TODO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if, instead of making open_datatree
optional, it would be possible to expose a separate entrypoint?
My motivation is a composite backend where a Dataset
entrypoint would not make sense (I'm using open_dataset
to open several netcdf files and arrange them in a hierarchy encoded in the file names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keewis My instinct is that a separate entrypoint would give a messier API (would we need three entrypoints for each combination of datatree and/or dataset implementation, each subclassed from a private base class?). Perhaps there would be a clever way to just have at least one of open_datatree
or open_dataset
required, so a single entrypoint (i.e., the existing BackendEntrypoint
) could handle every combination?
(Note, I haven't revisited this code since the initial draft, so I very well might be under mistaken impressions relative to the current state of the codebase!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps there would be a clever way to just have at least one of open_datatree or open_dataset required, so a single entrypoint (i.e., the existing BackendEntrypoint) could handle every combination?
Seems possible, see #7460 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea I had was that you'd have two entirely separate entrypoints (as in, package metadata entrypoints that can be registered against), one for Dataset
/ DataArray
and one for DataTree
. The advantage would be to keep both entirely separate, so we wouldn't have a need for open_dataset_parameters
and open_datatree_parameters
.
However, that might make it a bit confusing since it would allow two different packages to register entrypoints under the same name, so I won't insist on it (which may or may not be intentional). Plus, keeping the functions in a single BackendEntrypoint
makes raising helpful error messages a bit easier.
(And, one might argue that given any of the functions it is possible to implement the others, although it might not be the most efficient way to implement them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're getting closer to the finish line I think we should put some effort in getting the type hints correct.
There are some odd things going on here that is probably a good idea to sort out while we're doing larger changes in the backend designs.
The files I look at usually have 2000+ variables spread out over ~20 groups. So for me it would be great to try and parallelize as much as possible, So for brainstorming purpose, here's an example function I've been testing where the groups are opened in parallel inspired by open_mfdataset
:
def import_datatrees(
paths: str | os.PathLike | NestedSequence[str | os.PathLike],
*,
parallel: bool = False,
) -> DataTree:
from datatree import DataTree
if parallel:
import dask
# wrap the open_dataset, getattr, and preprocess with delayed
open_ = dask.delayed(xr.open_dataset)
getattr_ = dask.delayed(getattr)
else:
open_ = xr.open_dataset
getattr_ = getattr
# Get the groups for each file:
paths_and_groups: dict[str, tuple[str | int | Any, ...]] = {
path: CustomBackendEntrypoint.open_groups(path)
for path in xr.backends.common._find_absolute_paths(paths)
}
keys = []
datasets = []
for path, groups_ in paths_and_groups.items():
for group in groups_:
keys.append(f"{path}\{group}")
datasets.append(
open_(
path,
path=path,
engine=CustomBackendEntrypoint,
group=group,
# chunks={},
# parallel=True,
)
)
closers = [getattr_(ds, "_close") for ds in datasets]
if parallel:
# calling compute here will return the datasets/file_objs lists,
# the underlying datasets will still be stored as dask arrays
datasets, closers = dask.compute(datasets, closers)
return DataTree.from_dict({k: d for k, d in zip(keys, datasets)})
In this version the CustomBackendEntryPoint
has to
- be able to (quickly) get a list of all available groups in the file through
open_groups
. - know what to do if the
group
-argument is passed to open_dataset.
datasets[path] = ds | ||
|
||
# Recursively add children to collector | ||
for child_name, child_store in store.get_group_stores().items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an idea here to use dask.delayed
and open all groups in parallel in similar fashion to open_mfdataset
:
Lines 1000 to 1020 in 52f5cf1
if parallel: | |
import dask | |
# wrap the open_dataset, getattr, and preprocess with delayed | |
open_ = dask.delayed(open_dataset) | |
getattr_ = dask.delayed(getattr) | |
if preprocess is not None: | |
preprocess = dask.delayed(preprocess) | |
else: | |
open_ = open_dataset | |
getattr_ = getattr | |
datasets = [open_(p, **open_kwargs) for p in paths] | |
closers = [getattr_(ds, "_close") for ds in datasets] | |
if preprocess is not None: | |
datasets = [preprocess(ds) for ds in datasets] | |
if parallel: | |
# calling compute here will return the datasets/file_objs lists, | |
# the underlying datasets will still be stored as dask arrays | |
datasets, closers = dask.compute(datasets, closers) |
With that parallel mindset I was thinking that it would be better if get_group_stores
should return a flat list of all groups in the file, store.get_group_stores() = ['group0\group0', 'group0\group1', 'group1']
I was prompted to take a look back at this again after the update on datatree at SciPy (my apologies for letting this PR go stale for so long while lots of life stuff got in the way), and it looks like this PR, as titled, has been resolved by #8697 and #9014! So, would it make sense to close this? That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization. Should any new/replacement issues be raised to not lose track of these through the course of #8572? (Also, speaking of SciPy, I'll be at the sprints and would be happy to get back into working on this...perhaps trying to add datatree support on the cfgrib side, though with anticipation of #9137?) |
Yes I think this can be closed!
What is there to keep track of? The issue is solved in the sense that external backends can now use the additions to the backend entrypoint class to implement their own
Amazing! I'll be there on the Saturday at least, and was planning to do datatree either/or virtualizarr stuff (could even combine them together...) |
Good point! I was thinking it was tracking the "has been used" rather than "can be used", but the latter makes more sense. @Illviljan had some comments above about parallelized reads with dask, but perhaps those are likely better suited for xarray-contrib/datatree#97 and xarray-contrib/datatree#196? |
I think any further optimizations (dask or otherwise) deserve their own separate issues. |
As discussed among folks at today's Pangeo working meeting (cc @jhamman, @TomNicholas), we are looking to try adding support for
DataTree
in the Backend API, so that backend engines can readily addDataTree
capability. For example, withcfgrib
, we could havegiven that
cfgrib
implements the appropriate method to theirBackendEntrypoint
subclass. Similarly, with NetCDF files or Zarr stores with groups, we could open asDataTree
to obviate the need to specify a single group.Working Design Doc: https://hackmd.io/Oqeab-54TqOOHd5FdCb5DQ?edit
xref ecmwf/cfgrib#327, openradar/xradar#7
Closes #xxxxwhats-new.rst
api.rst