Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Implement open_datatree in BackendEntrypoint for preliminary DataTree support #7437

Closed
wants to merge 6 commits into from

Conversation

jthielen
Copy link
Contributor

As discussed among folks at today's Pangeo working meeting (cc @jhamman, @TomNicholas), we are looking to try adding support for DataTree in the Backend API, so that backend engines can readily add DataTree capability. For example, with cfgrib, we could have

import xarray as xr

dt = xr.open_datatree("path/to/gribfile.grib", engine="cfgrib")

given that cfgrib implements the appropriate method to their BackendEntrypoint subclass. Similarly, with NetCDF files or Zarr stores with groups, we could open as DataTree to obviate the need to specify a single group.

Working Design Doc: https://hackmd.io/Oqeab-54TqOOHd5FdCb5DQ?edit

xref ecmwf/cfgrib#327, openradar/xradar#7

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@TomNicholas
Copy link
Member

Holla when you want a review :)

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jan 17, 2023
Comment on lines +379 to +385
Optionally, it shall implement:

- ``open_datatree`` method: it shall implement reading from file, variables
decoding and it returns an instance of :py:class:`~datatree.DataTree`.
It shall take in input at least ``filename_or_obj`` argument and
``drop_variables`` keyword argument.
For more details see TODO.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if, instead of making open_datatree optional, it would be possible to expose a separate entrypoint?

My motivation is a composite backend where a Dataset entrypoint would not make sense (I'm using open_dataset to open several netcdf files and arrange them in a hierarchy encoded in the file names).

Copy link
Contributor Author

@jthielen jthielen Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis My instinct is that a separate entrypoint would give a messier API (would we need three entrypoints for each combination of datatree and/or dataset implementation, each subclassed from a private base class?). Perhaps there would be a clever way to just have at least one of open_datatree or open_dataset required, so a single entrypoint (i.e., the existing BackendEntrypoint) could handle every combination?

(Note, I haven't revisited this code since the initial draft, so I very well might be under mistaken impressions relative to the current state of the codebase!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there would be a clever way to just have at least one of open_datatree or open_dataset required, so a single entrypoint (i.e., the existing BackendEntrypoint) could handle every combination?

Seems possible, see #7460 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea I had was that you'd have two entirely separate entrypoints (as in, package metadata entrypoints that can be registered against), one for Dataset / DataArray and one for DataTree. The advantage would be to keep both entirely separate, so we wouldn't have a need for open_dataset_parameters and open_datatree_parameters.

However, that might make it a bit confusing since it would allow two different packages to register entrypoints under the same name, so I won't insist on it (which may or may not be intentional). Plus, keeping the functions in a single BackendEntrypoint makes raising helpful error messages a bit easier.

(And, one might argue that given any of the functions it is possible to implement the others, although it might not be the most efficient way to implement them)

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're getting closer to the finish line I think we should put some effort in getting the type hints correct.
There are some odd things going on here that is probably a good idea to sort out while we're doing larger changes in the backend designs.

The files I look at usually have 2000+ variables spread out over ~20 groups. So for me it would be great to try and parallelize as much as possible, So for brainstorming purpose, here's an example function I've been testing where the groups are opened in parallel inspired by open_mfdataset:

def import_datatrees(
    paths: str | os.PathLike | NestedSequence[str | os.PathLike],
    *,
    parallel: bool = False,
) -> DataTree:
    from datatree import DataTree

    if parallel:
        import dask

        # wrap the open_dataset, getattr, and preprocess with delayed
        open_ = dask.delayed(xr.open_dataset)
        getattr_ = dask.delayed(getattr)
    else:
        open_ = xr.open_dataset
        getattr_ = getattr

    # Get the groups for each file:
    paths_and_groups: dict[str, tuple[str | int | Any, ...]] = {
        path: CustomBackendEntrypoint.open_groups(path)
        for path in xr.backends.common._find_absolute_paths(paths)
    }

    keys = []
    datasets = []
    for path, groups_ in paths_and_groups.items():
        for group in groups_:
            keys.append(f"{path}\{group}")
            datasets.append(
                open_(
                    path,
                    path=path,
                    engine=CustomBackendEntrypoint,
                    group=group,
                    # chunks={},
                    # parallel=True,
                )
            )
    closers = [getattr_(ds, "_close") for ds in datasets]

    if parallel:
        # calling compute here will return the datasets/file_objs lists,
        # the underlying datasets will still be stored as dask arrays
        datasets, closers = dask.compute(datasets, closers)

    return DataTree.from_dict({k: d for k, d in zip(keys, datasets)})

In this version the CustomBackendEntryPoint has to

  • be able to (quickly) get a list of all available groups in the file through open_groups.
  • know what to do if the group-argument is passed to open_dataset.

datasets[path] = ds

# Recursively add children to collector
for child_name, child_store in store.get_group_stores().items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an idea here to use dask.delayed and open all groups in parallel in similar fashion to open_mfdataset:

xarray/xarray/backends/api.py

Lines 1000 to 1020 in 52f5cf1

if parallel:
import dask
# wrap the open_dataset, getattr, and preprocess with delayed
open_ = dask.delayed(open_dataset)
getattr_ = dask.delayed(getattr)
if preprocess is not None:
preprocess = dask.delayed(preprocess)
else:
open_ = open_dataset
getattr_ = getattr
datasets = [open_(p, **open_kwargs) for p in paths]
closers = [getattr_(ds, "_close") for ds in datasets]
if preprocess is not None:
datasets = [preprocess(ds) for ds in datasets]
if parallel:
# calling compute here will return the datasets/file_objs lists,
# the underlying datasets will still be stored as dask arrays
datasets, closers = dask.compute(datasets, closers)

With that parallel mindset I was thinking that it would be better if get_group_stores should return a flat list of all groups in the file, store.get_group_stores() = ['group0\group0', 'group0\group1', 'group1']

@jthielen
Copy link
Contributor Author

I was prompted to take a look back at this again after the update on datatree at SciPy (my apologies for letting this PR go stale for so long while lots of life stuff got in the way), and it looks like this PR, as titled, has been resolved by #8697 and #9014! So, would it make sense to close this?

That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization. Should any new/replacement issues be raised to not lose track of these through the course of #8572?

(Also, speaking of SciPy, I'll be at the sprints and would be happy to get back into working on this...perhaps trying to add datatree support on the cfgrib side, though with anticipation of #9137?)

@TomNicholas
Copy link
Member

TomNicholas commented Jul 12, 2024

Yes I think this can be closed!

That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization.

What is there to keep track of? The issue is solved in the sense that external backends can now use the additions to the backend entrypoint class to implement their own open_datatree method.

speaking of SciPy, I'll be at the sprints and would be happy to get back into working on this...

Amazing! I'll be there on the Saturday at least, and was planning to do datatree either/or virtualizarr stuff (could even combine them together...)

@jthielen jthielen closed this Jul 12, 2024
@jthielen
Copy link
Contributor Author

jthielen commented Jul 12, 2024

That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization.

What is there to keep track of? The issue is solved in the sense that external backends can now use the additions to the backend entrypoint class to implement their own open_datatree method.

Good point! I was thinking it was tracking the "has been used" rather than "can be used", but the latter makes more sense. @Illviljan had some comments above about parallelized reads with dask, but perhaps those are likely better suited for xarray-contrib/datatree#97 and xarray-contrib/datatree#196?

@TomNicholas
Copy link
Member

I think any further optimizations (dask or otherwise) deserve their own separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

4 participants