Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation #6174

Open
tovogt opened this issue Jan 19, 2022 · 11 comments
Labels
enhancement topic-backends topic-DataTree Related to the implementation of a DataTree class

Comments

@tovogt
Copy link
Contributor

tovogt commented Jan 19, 2022

Is your feature request related to a problem?

I know that there is a big discussion going on in #4118 about organizing hierarchies of datasets within xarray's data structures. But this issue is supposed to address only a comparably simple aspect of this.

Suppose that you have a list ds_list of xarray.Dataset objects with different dimensions etc. and you want to store them all in one NetCDF4 file by using the group feature introduced in NetCDF4. The group name of each dataset is stored in ds_names. Obviously, you can do something like this:

for name, ds in zip(ds_names, ds_list):
    ds.to_netcdf(path, group=name)

However, this is really slow when you have many (hundreds or thousands of) small datasets because the file is opened and closed in every iteration.

Describe the solution you'd like

I would like to have a function xr.to_netcdf that writes a list (or a dictionary) of datasets to a single NetCDF4 file with a single open/close operation. Ideally there should also be a way to read many datasets at once from a single NetCDF4 file using xr.open_dataset.

Describe alternatives you've considered

Currently, I'm using the following read/write functions to achieve the same:

import pathlib
from xarray.backends import NetCDF4DataStore
from xarray.backends.api import dump_to_store
from xarray.backends.common import ArrayWriter
from xarray.backends.store import StoreBackendEntrypoint

def _xr_to_netcdf_multi(path, ds_dict, encoding=None):
    """Write multiple xarray Datasets to separate groups in a single NetCDF4 file

    Parameters
    ----------
    path : str or Path
        Path of the target NetCDF file.
    ds_dict : dict whose keys are group names and values are xr.Dataset
        Each xr.Dataset in the dict is stored in the group identified by its key in the dict.
        Note that an empty string ("") is a valid group name and refers to the root group.
    encoding : dict whose keys are group names and values are dict, optional
        For each dataset/group, one dict that is compliant with the format of the `encoding`
        keyword parameter in `xr.Dataset.to_netcdf`. Default: None
    """
    path = str(pathlib.Path(path).expanduser().absolute())
    store = NetCDF4DataStore.open(path, "w", "NETCDF4", None)
    try:
        writer = ArrayWriter()
        for group, dataset in ds_dict.items():
            store._group = group
            unlimited_dims = dataset.encoding.get("unlimited_dims", None)
            encoding = None if encoding is None or group not in encoding else encoding[group]
            dump_to_store(dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims)
    finally:
        store.close()

def _xr_open_dataset_multi(path, prefix=""):
    """Read multiple xarray Datasets from groups contained in a single NetCDF4 file

    Warning: The data is loaded into memory!

    Parameters
    ----------
    path : str or Path
        Path of the NetCDF file to read.
    prefix : str, optional
        If given, only read groups whose name starts with this prefix. Default: ""

    Returns
    -------
    ds_dict : dict whose keys are group names and values are xr.Dataset
        Each xr.Dataset in the dict is taken from the group identified by its key in the dict.
        Note that an empty string ("") is a valid group name and refers to the root group.
    """
    path = str(pathlib.Path(path).expanduser().absolute())
    store = NetCDF4DataStore.open(path, "r", "NETCDF4", None)
    ds_dict = {}
    try:
        groups = [g for g in _xr_nc4_groups_from_store(store) if g.startswith(prefix)]
        store_entrypoint = StoreBackendEntrypoint()
        for group in groups:
            store._group = group
            ds = store_entrypoint.open_dataset(store)
            ds.load()
            ds_dict[group] = ds
    finally:
        store.close()
    return ds_dict

def _xr_nc4_groups_from_store(store):
    """List all groups contained in the given NetCDF4 data store

    Parameters
    ----------
    store : xarray.backend.NetCDF4DataStore

    Returns
    -------
    list of str
    """
    def iter_groups(ds, prefix=""):
        groups = [""]
        for group_name, group_ds in ds.groups.items():
            groups.extend([f"{prefix}{group_name}{subgroup}"
                           for subgroup in iter_groups(group_ds, prefix="/")])
        return groups
    with store._manager.acquire_context(False) as root:
        return iter_groups(root)

Additional context

No response

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Jan 19, 2022
@TomNicholas
Copy link
Member

TomNicholas commented Jan 19, 2022

I would like to have a function xr.to_netcdf that writes a list (or a dictionary) of datasets to a single NetCDF4 file.

If you've read through all of #4118 you will have seen that there is a prototype package providing a nested data structure which can handle groups. Using DataTree we can easily write a dictionary of datasets to a single netCDF file as groups:

from datatree import DataTree

dt = DataTree.from_dict(ds_dict)
dt.to_netcdf('filepath.nc')

(Here if you want groups within groups then the keys in the dictionary should be specified like filepaths, e.g. /group1/group2/ds_name.)

Ideally there should also be a way to read many datasets at once from a single NetCDF4 file using xr.open_dataset.

Again DataTree allows you to open all the groups at once, returning a tree-like structure which contains all the groups:

dt = open_datatree('filepath.nc')

To extract all the groups as individual datasets you can do this to recreate the dictionary of datasets:

ds_dict = {node.pathstr: node.ds for node in dt.subtree}

However, this is really slow when you have many (hundreds or thousands of) small datasets because the file is opened and closed in every iteration.

Currently, I'm using the following read/write functions to achieve the same:

Is your solution noticeably faster? We (@jhamman and I) haven't really thought about speed of DataTree I/O yet I don't think, preferring to just make something simple which works for now. The current I/O code for DataTree is here.

Despite that project only being a prototype, it is still probably the best solution to your problem that we currently have (at least the neatest). If you are interested in trying it out and reporting any problems then that would be greatly appreciated!

EDIT: The idea discussed here might also be of interest to you.

@tovogt
Copy link
Contributor Author

tovogt commented Jan 20, 2022

Thanks for your quick response, Tom!

I'm sure that DataTree is a really neat solution for most people working with hierarchically structured data.

In my case, we are talking about a very unusual application of the NetCDF4 groups feature: We store literally thousands of very small NetCDF datasets in a single file. A file containing 3000 datasets is typically not larger than 100 MB.

With that setup, the I/O performance is critical. Opening and closing the file on each group read/write is very, very bad. On our cluster this means that writing that 100 MB file takes 10 hours with your DataTree implementation, and 30 minutes with my helper functions. For reading, the effect is smaller, but still noticeable.

So, my request is really about the I/O performance, and I don't need a full-fledged hierarchical data management API in xarray for that.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 20, 2022

In my case, we are talking about a very unusual application of the NetCDF4 groups feature: We store literally thousands of very small NetCDF datasets in a single file. A file containing 3000 datasets is typically not larger than 100 MB.

Ah - thanks for the clarification as to the context @tovogt !

So, my request is really about the I/O performance, and I don't need a full-fledged hierarchical data management API in xarray for that.

That's fair enough.

On our cluster this means that writing that 100 MB file takes 10 hours with your DataTree implementation, and 30 minutes with my helper functions. For reading, the effect is smaller, but still noticeable.

So are you asking if:
a) We should add a function to xarray which uses the same trick your helper functions do, for when people have a similar problem to you?
b) We should use the same trick your helper functions do to rewrite the I/O implementation of DataTree to only require one open/close? (It seems to me that this could be the best of both worlds, once implemented.)
c) Whether there is some other way to do this even faster than your helper functions?

EDIT: Tagging @alexamici / @aurghs for their backends expertise + interest in DataTree

@tovogt
Copy link
Contributor Author

tovogt commented Jan 21, 2022

When I first posted this issue, I thought, the best solution is to just implement my proposed helper functions as part of the official xarray API. I don't think our project would add DataTree as a new dependency just for this as long as we have a very easy and viable solution of ourselves.

But now I have a new idea. At first, I noticed that open_dataset won't actually close the file handle, but reuse it later if needed. So, at least there is no performance problem with the current read setup. For writing, there should be an option in to_netcdf that ensures that xarray is not closing the file handle. xarray already uses a CachingFileManager to open NetCDF4-files:

manager = CachingFileManager(
netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
)

That means, that manager already ensures that the same file handle is re-used in subsequent operations of to_netcdf with the same file, unless it's closed in the meantime. Closing is managed here:

xarray/xarray/backends/api.py

Lines 1072 to 1094 in 0ffb0f4

dump_to_store(
dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
)
if autoclose:
store.close()
if multifile:
return writer, store
writes = writer.sync(compute=compute)
if path_or_file is None:
store.sync()
return target.getvalue()
finally:
if not multifile and compute:
store.close()
if not compute:
import dask
return dask.delayed(_finalize_store)(writes, store)
return None

It's a bit intransparent, when closing is actually triggered in practice - especially if you only look at the current docstrings. I found that, in fact, setting compute=False in to_netcdf will prevent the closing until you explicitly call compute on the returned object:

for name, ds in zip(ds_names, ds_list):
    delayed = ds.to_netcdf(path, group=name, compute=False)
delayed.compute()

If this would be communicated more transparently in the docstrings, it would bring us a big step closer to the solution of this issue 🙂 Apart from that, there is only one problem left: Getting a full list of all groups contained in a NetCDF4 file so that we can read them all in. In DataTree, you fall back to using directly the NetCDF4 (or h5netcdf) API for that purpose: _get_nc_dataset_class and _iter_nc_groups. That's not the worst solution. However, I would insist that xarray should be able to do this. Maybe we need a open_datasets_from_groups function for that, or rather a function list_datasets. But it should somehow be solvable within the xarray API without requiring a two-year debate about the management and representation of hierarchical data structures.

@TomNicholas
Copy link
Member

I don't think our project would add DataTree as a new dependency just for this as long as we have a very easy and viable solution of ourselves.

FYI the plan with DataTree is to eventually integrate the work upstream into xarray, so no new dependency would be required at that point. That might take a while however.

If this would be communicated more transparently in the docstrings, it would bring us a big step closer to the solution of this issue

That's good at least! Do you have any suggestions for where the docs should be improved? PRs are of course always welcome too 😁

one problem left: Getting a full list of all groups contained in a NetCDF4 file so that we can read them all in.

I would insist that xarray should be able to do this. Maybe we need a open_datasets_from_groups function for that, or rather a function list_datasets. But it should somehow be solvable within the xarray API without requiring a two-year debate about the management and representation of hierarchical data structures.

I agree, and would be open to a function like this (even if eventually DataTree renders it redundant). It's definitely an omission on our part that xarray still doesn't provide an easy way to do this - I've found myself wanting to easily see all the groups multiple times. However, my understanding is that it's slightly tricky to implement, though suggestions/corrections are welcome!

@Illviljan
Copy link
Contributor

Is it that difficult to get a list of groups though? I've been testing a backend engine that merges many groups into 1 dataset (dims/coords/variables renamed slightly to avoid duplicate names until they've been interpolated together) using h5py.

Getting the groups are like the first thing you have to do, the code would look something like this:

>>> f = h5py.File('foo.hdf5','w')
>>> f.name
'/'
>>> list(f.keys())
[]

https://docs.h5py.org/en/stable/high/group.html

Sure, it can be quite tiresome to navigate the backend engines and 3rd party modules in xarray to add this. But most of them uses h5py or something quite similar at its core so it shouldn't be THAT bad.

For example one could add another method here that retrieves them in a quick and easy way:

class BackendEntrypoint:
"""
``BackendEntrypoint`` is a class container and it is the main interface
for the backend plugins, see :ref:`RST backend_entrypoint`.
It shall implement:

@tovogt
Copy link
Contributor Author

tovogt commented Jan 24, 2022

It's not at all tricky to implement the listing of groups in a NETCDF4 file, at least not for the "netcdf4" engine. The code for that is in my OP above:

def _xr_nc4_groups_from_store(store):
    """List all groups contained in the given NetCDF4 data store

    Parameters
    ----------
    store : xarray.backend.NetCDF4DataStore

    Returns
    -------
    list of str
    """
    def iter_groups(ds, prefix=""):
        groups = [""]
        for group_name, group_ds in ds.groups.items():
            groups.extend([f"{prefix}{group_name}{subgroup}"
                           for subgroup in iter_groups(group_ds, prefix="/")])
        return groups
    with store._manager.acquire_context(False) as root:
        return iter_groups(root)

@tovogt
Copy link
Contributor Author

tovogt commented Jan 24, 2022

That's good at least! Do you have any suggestions for where the docs should be improved? PRs are of course always welcome too

Here is my PR for the docstring improvements: #6187

@shoyer
Copy link
Member

shoyer commented Feb 2, 2022

Have you seen xarray.save_mfdataset?

In principle, it was designed for exactly this sort of thing.

@tovogt
Copy link
Contributor Author

tovogt commented Feb 3, 2022

Have you seen xarray.save_mfdataset?

In principle, it was designed for exactly this sort of thing.

Thanks for the hint! Unfortunately, it says already in the docstring that "it is no different than calling to_netcdf repeatedly". And I explained in my OP that this would cause repeated file open/close operations - which is the whole point of this issue.

Furthermore, when using save_mfdataset with my setup, it complains:

ValueError: cannot use mode='w' when writing multiple datasets to the same path

But when using mode='a' instead, it will complain that the file doesn't exist.

However, it might still be the way to go API-wise. So, when talking about the solution of this issue, we could aim at fixing save_mfdataset: 1) Writing to the same file should use a single open/close operation. 2) Support mode='w' (or mode='w+') when writing several datasets to the same path.

@TomNicholas
Copy link
Member

TomNicholas commented Oct 21, 2024

FYI I think your _xr_open_dataset_multi function can now be replaced with xr.open_groups. That function only opens the file once, but returns a dict[str, xr.Dataset].

Furthermore, when using save_mfdataset with my setup, it complains:

ValueError: cannot use mode='w' when writing multiple datasets to the same path

Is it possible for multiple writers to safely write to different groups of a netCDF file at once? This could be done with zarr (definitely with icechunk), in which case this error could be relaxed.

But when using mode='a' instead, it will complain that the file doesn't exist.

This seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-backends topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants