DRAFT: Implement `open_datatree` in BackendEntrypoint for preliminary DataTree support #7437

jthielen · 2023-01-13T17:17:41Z

As discussed among folks at today's Pangeo working meeting (cc @jhamman, @TomNicholas), we are looking to try adding support for DataTree in the Backend API, so that backend engines can readily add DataTree capability. For example, with cfgrib, we could have

import xarray as xr

dt = xr.open_datatree("path/to/gribfile.grib", engine="cfgrib")

given that cfgrib implements the appropriate method to their BackendEntrypoint subclass. Similarly, with NetCDF files or Zarr stores with groups, we could open as DataTree to obviate the need to specify a single group.

Working Design Doc: https://hackmd.io/Oqeab-54TqOOHd5FdCb5DQ?edit

xref ecmwf/cfgrib#327, openradar/xradar#7

~~Closes #xxxx~~
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

…code)

for more information, see https://pre-commit.ci

TomNicholas · 2023-01-13T22:31:40Z

Holla when you want a review :)

keewis · 2023-01-27T15:47:53Z

xarray/backends/common.py

+    Optionally, it shall implement:
+
+    - ``open_datatree`` method: it shall implement reading from file, variables
+      decoding and it returns an instance of :py:class:`~datatree.DataTree`.
+      It shall take in input at least ``filename_or_obj`` argument and
+      ``drop_variables`` keyword argument.
+      For more details see TODO.


I wonder if, instead of making open_datatree optional, it would be possible to expose a separate entrypoint?

My motivation is a composite backend where a Dataset entrypoint would not make sense (I'm using open_dataset to open several netcdf files and arrange them in a hierarchy encoded in the file names).

@keewis My instinct is that a separate entrypoint would give a messier API (would we need three entrypoints for each combination of datatree and/or dataset implementation, each subclassed from a private base class?). Perhaps there would be a clever way to just have at least one of open_datatree or open_dataset required, so a single entrypoint (i.e., the existing BackendEntrypoint) could handle every combination?

(Note, I haven't revisited this code since the initial draft, so I very well might be under mistaken impressions relative to the current state of the codebase!)

Perhaps there would be a clever way to just have at least one of open_datatree or open_dataset required, so a single entrypoint (i.e., the existing BackendEntrypoint) could handle every combination?

Seems possible, see #7460 (comment)

the idea I had was that you'd have two entirely separate entrypoints (as in, package metadata entrypoints that can be registered against), one for Dataset / DataArray and one for DataTree. The advantage would be to keep both entirely separate, so we wouldn't have a need for open_dataset_parameters and open_datatree_parameters.

However, that might make it a bit confusing since it would allow two different packages to register entrypoints under the same name, so I won't insist on it (which may or may not be intentional). Plus, keeping the functions in a single BackendEntrypoint makes raising helpful error messages a bit easier.

(And, one might argue that given any of the functions it is possible to implement the others, although it might not be the most efficient way to implement them)

Illviljan

When we're getting closer to the finish line I think we should put some effort in getting the type hints correct.
There are some odd things going on here that is probably a good idea to sort out while we're doing larger changes in the backend designs.

The files I look at usually have 2000+ variables spread out over ~20 groups. So for me it would be great to try and parallelize as much as possible, So for brainstorming purpose, here's an example function I've been testing where the groups are opened in parallel inspired by open_mfdataset:

def import_datatrees(
    paths: str | os.PathLike | NestedSequence[str | os.PathLike],
    *,
    parallel: bool = False,
) -> DataTree:
    from datatree import DataTree

    if parallel:
        import dask

        # wrap the open_dataset, getattr, and preprocess with delayed
        open_ = dask.delayed(xr.open_dataset)
        getattr_ = dask.delayed(getattr)
    else:
        open_ = xr.open_dataset
        getattr_ = getattr

    # Get the groups for each file:
    paths_and_groups: dict[str, tuple[str | int | Any, ...]] = {
        path: CustomBackendEntrypoint.open_groups(path)
        for path in xr.backends.common._find_absolute_paths(paths)
    }

    keys = []
    datasets = []
    for path, groups_ in paths_and_groups.items():
        for group in groups_:
            keys.append(f"{path}\{group}")
            datasets.append(
                open_(
                    path,
                    path=path,
                    engine=CustomBackendEntrypoint,
                    group=group,
                    # chunks={},
                    # parallel=True,
                )
            )
    closers = [getattr_(ds, "_close") for ds in datasets]

    if parallel:
        # calling compute here will return the datasets/file_objs lists,
        # the underlying datasets will still be stored as dask arrays
        datasets, closers = dask.compute(datasets, closers)

    return DataTree.from_dict({k: d for k, d in zip(keys, datasets)})

In this version the CustomBackendEntryPoint has to

be able to (quickly) get a list of all available groups in the file through open_groups.
know what to do if the group-argument is passed to open_dataset.

Illviljan · 2023-07-28T19:53:47Z

xarray/backends/store.py

+            datasets[path] = ds
+
+            # Recursively add children to collector
+            for child_name, child_store in store.get_group_stores().items():


I had an idea here to use dask.delayed and open all groups in parallel in similar fashion to open_mfdataset:

xarray/xarray/backends/api.py

Lines 1000 to 1020 in 52f5cf1

if parallel:

import dask

# wrap the open_dataset, getattr, and preprocess with delayed

open_ = dask.delayed(open_dataset)

getattr_ = dask.delayed(getattr)

if preprocess is not None:

preprocess = dask.delayed(preprocess)

else:

open_ = open_dataset

getattr_ = getattr

datasets = [open_(p, **open_kwargs) for p in paths]

closers = [getattr_(ds, "_close") for ds in datasets]

if preprocess is not None:

datasets = [preprocess(ds) for ds in datasets]

if parallel:

# calling compute here will return the datasets/file_objs lists,

# the underlying datasets will still be stored as dask arrays

datasets, closers = dask.compute(datasets, closers)

With that parallel mindset I was thinking that it would be better if get_group_stores should return a flat list of all groups in the file, store.get_group_stores() = ['group0\group0', 'group0\group1', 'group1']

jthielen · 2024-07-12T17:03:28Z

I was prompted to take a look back at this again after the update on datatree at SciPy (my apologies for letting this PR go stale for so long while lots of life stuff got in the way), and it looks like this PR, as titled, has been resolved by #8697 and #9014! So, would it make sense to close this?

That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization. Should any new/replacement issues be raised to not lose track of these through the course of #8572?

(Also, speaking of SciPy, I'll be at the sprints and would be happy to get back into working on this...perhaps trying to add datatree support on the cfgrib side, though with anticipation of #9137?)

TomNicholas · 2024-07-12T18:50:56Z

Yes I think this can be closed!

That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization.

What is there to keep track of? The issue is solved in the sense that external backends can now use the additions to the backend entrypoint class to implement their own open_datatree method.

speaking of SciPy, I'll be at the sprints and would be happy to get back into working on this...

Amazing! I'll be there on the Saturday at least, and was planning to do datatree either/or virtualizarr stuff (could even combine them together...)

jthielen · 2024-07-12T19:30:48Z

That being said, it does look like the PR is referenced in a couple places with respect to outside backend support and parallelization.

What is there to keep track of? The issue is solved in the sense that external backends can now use the additions to the backend entrypoint class to implement their own open_datatree method.

Good point! I was thinking it was tracking the "has been used" rather than "can be used", but the latter makes more sense. @Illviljan had some comments above about parallelized reads with dask, but perhaps those are likely better suited for xarray-contrib/datatree#97 and xarray-contrib/datatree#196?

TomNicholas · 2024-07-12T20:28:08Z

I think any further optimizations (dask or otherwise) deserve their own separate issues.

Initial API placeholder for open_datatree in backend

cb0ca37

github-actions bot added io topic-backends labels Jan 13, 2023

jthielen added 2 commits January 13, 2023 14:29

Attempt to add datatree open to backend (for netcdf and abstract)

39ba056

Add top-level open_datatree function (TODO: deduplicate and clean up …

77129f1

…code)

jthielen force-pushed the datatree-backend branch from 66ff624 to 77129f1 Compare January 13, 2023 22:19

jthielen and others added 2 commits January 13, 2023 15:22

remove debug line

040f3e0

[pre-commit.ci] auto fixes from pre-commit.com hooks

e9e0ded

for more information, see https://pre-commit.ci

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jan 17, 2023

Merge branch 'main' into datatree-backend

9d64b36

keewis reviewed Jan 27, 2023

View reviewed changes

jhamman mentioned this pull request Jan 31, 2023

Add abstractmethods to backend classes #7460

Draft

4 tasks

TomNicholas mentioned this pull request Feb 1, 2023

Import datatree in xarray? #7418

Closed

4 tasks

mgrover1 mentioned this pull request Mar 17, 2023

Improve parallelization of IO for readers openradar/xradar#99

Open

blaylockbk mentioned this pull request Jul 26, 2023

open_datasets fails to open GRIB messages of same parameter with different forecastTime values, silently skipping them ecmwf/cfgrib#344

Open

Illviljan reviewed Jul 28, 2023

View reviewed changes

TomNicholas mentioned this pull request Oct 16, 2023

Refactor file format backend openers fsspec/kerchunk#376

Open

blaylockbk mentioned this pull request Nov 7, 2023

Download multiple fields with FastHerbie? blaylockbk/Herbie#242

Closed

keewis mentioned this pull request Dec 4, 2023

Add grib_tree method fsspec/kerchunk#399

Merged

TomNicholas mentioned this pull request Dec 22, 2023

Track merging datatree into xarray #8572

Closed

27 tasks

kmuehlbauer mentioned this pull request May 3, 2024

Improving performance of open_datatree #8994

Closed

4 tasks

kmuehlbauer mentioned this pull request Jun 5, 2024

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Merged

3 tasks

jthielen closed this Jul 12, 2024

TomNicholas mentioned this pull request Jul 30, 2024

Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint #9243

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Implement `open_datatree` in BackendEntrypoint for preliminary DataTree support #7437

DRAFT: Implement `open_datatree` in BackendEntrypoint for preliminary DataTree support #7437

jthielen commented Jan 13, 2023

TomNicholas commented Jan 13, 2023

keewis Jan 27, 2023

jthielen Jul 28, 2023 •

edited

Loading

TomNicholas Jul 28, 2023

keewis Jul 31, 2023

Illviljan left a comment •

edited

Loading

Illviljan Jul 28, 2023

jthielen commented Jul 12, 2024

TomNicholas commented Jul 12, 2024 •

edited

Loading

jthielen commented Jul 12, 2024 •

edited

Loading

TomNicholas commented Jul 12, 2024

	if parallel:
	import dask

	# wrap the open_dataset, getattr, and preprocess with delayed
	open_ = dask.delayed(open_dataset)
	getattr_ = dask.delayed(getattr)
	if preprocess is not None:
	preprocess = dask.delayed(preprocess)
	else:
	open_ = open_dataset
	getattr_ = getattr

	datasets = [open_(p, **open_kwargs) for p in paths]
	closers = [getattr_(ds, "_close") for ds in datasets]
	if preprocess is not None:
	datasets = [preprocess(ds) for ds in datasets]

	if parallel:
	# calling compute here will return the datasets/file_objs lists,
	# the underlying datasets will still be stored as dask arrays
	datasets, closers = dask.compute(datasets, closers)

DRAFT: Implement open_datatree in BackendEntrypoint for preliminary DataTree support #7437

DRAFT: Implement open_datatree in BackendEntrypoint for preliminary DataTree support #7437

Conversation

jthielen commented Jan 13, 2023

TomNicholas commented Jan 13, 2023

keewis Jan 27, 2023

Choose a reason for hiding this comment

jthielen Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

TomNicholas Jul 28, 2023

Choose a reason for hiding this comment

keewis Jul 31, 2023

Choose a reason for hiding this comment

Illviljan left a comment • edited Loading

Choose a reason for hiding this comment

Illviljan Jul 28, 2023

Choose a reason for hiding this comment

jthielen commented Jul 12, 2024

TomNicholas commented Jul 12, 2024 • edited Loading

jthielen commented Jul 12, 2024 • edited Loading

TomNicholas commented Jul 12, 2024

DRAFT: Implement `open_datatree` in BackendEntrypoint for preliminary DataTree support #7437

DRAFT: Implement `open_datatree` in BackendEntrypoint for preliminary DataTree support #7437

jthielen Jul 28, 2023 •

edited

Loading

Illviljan left a comment •

edited

Loading

TomNicholas commented Jul 12, 2024 •

edited

Loading

jthielen commented Jul 12, 2024 •

edited

Loading