Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_dict_of_datasets function to open any file containing nested groups #9137

Closed
TomNicholas opened this issue Jun 19, 2024 · 7 comments · Fixed by #9243
Closed

open_dict_of_datasets function to open any file containing nested groups #9137

TomNicholas opened this issue Jun 19, 2024 · 7 comments · Fixed by #9243
Labels
enhancement io topic-backends topic-DataTree Related to the implementation of a DataTree class

Comments

@TomNicholas
Copy link
Contributor

Is your feature request related to a problem?

In #9077 (comment) I suggested the idea of a function which could open any netCDF file with groups as a dictionary mapping group path strings to xr.Dataset objects.

The motivation is as follows:

  • People want the new xarray.DataTree class to support inheriting coordinates from parent groups,
  • This can only be done if the coordinates align with the variables in the child group (i.e. using xr.align),
  • The best time to enforce this alignment is at DataTree construction time,
  • This requirement is not enforced in the netCDF/Zarr model, so this would mean some files can no longer be opened by open_datatree directly, as doing so would raise an alignment error,
  • But we still really want users to have some way to open an arbitrary file with xarray and see what's inside (including displaying all the groups Opening a dataset doesn't display groups. #4840).
  • A simpler intermediate structure of a dictionary mapping group paths to xarray.Dataset objects doesn't enforce alignment, so can represent any file.
  • We should add a new opening function to allow any file to be opened as this dict-of-datasets structure.
  • Users can then use this to inspect "untidy" data, and make changes to the dict returned before creating an aligned DataTree object via DataTree.from_dict if they like.

Describe the solution you'd like

Add a function like this:

def open_dict_of_datasets(
    filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
    engine: T_Engine = None,
    group: Optional[str] = None,
    **kwargs,
) -> dict[str, Dataset]:
    """
    Open and decode a file or file-like object, creating a dictionary containing one xarray Dataset for each group in the file.

    Useful when you have e.g. a netCDF file containing many groups, some of which are not alignable with their parents and so the file cannot be opened directly with ``open_datatree``.

    It is encouraged to use this function to inspect your data, then make the necessary changes to make the structure coercible to a `DataTree` object before calling `DataTree.from_dict()` and proceeding with your analysis.

    Parameters
    ----------
    filename_or_obj : str, Path, file-like, or DataStore
        Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
    engine : str, optional
        Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
    group : str, optional
        Group to use as the root group to start reading from. Groups above this root group will not be included in the output.
    **kwargs : dict
        Additional keyword arguments passed to :py:func:`~xarray.open_dataset` for each group.

    Returns
    -------
    dict[str, xarray.Dataset]

    See Also
    --------
    open_datatree()
    DataTree.from_dict()
    """
    ...

This would live inside backends.api.py, and be exposed publicly as a top-level function along with the rest of open_datatree/DataTree etc. as part of #9033.

The actual implementation could re-use the code for opening many groups of the same file performantly from #9014. Indeed we could add a open_dict_of_datasets method to the BackendEntryPoint class, which uses pretty much the same code as the existing open_datatree method added in #9014 but just doesn't actually create a DataTree object.

Describe alternatives you've considered

Really the main alternative to this is not to have coordinate inheritance in DataTree at all (see 9077), in which case open_datatree would be sufficient to open any file.


The name of the function is up for debate. I prefer nothing with the word "datatree" in it since this doesn't actually create a DataTree object at any point. (In fact we could and perhaps should have implemented this function years ago, even without the new DataTree class.) The reason for not calling it "open_as_dict_of_datasets" is that we don't use "as" in the existing open_dataset/open_dataarray etc.

Additional context

cc @eni-awowale @flamingbear @owenlittlejohns @keewis @shoyer @autydp

@TomNicholas TomNicholas added enhancement topic-DataTree Related to the implementation of a DataTree class labels Jun 19, 2024
@shoyer
Copy link
Member

shoyer commented Jun 20, 2024

I'm not sure if we'd want to use exactly the same API, but an additional use-case opening a dict of Dataset objects is opening a glob/directory of netCDF files, i.e., the first part of open_mfdataset.

@keewis
Copy link
Collaborator

keewis commented Jul 3, 2024

just to throw out more ideas: open_groups or open_group_dict could also work as names, and wouldn't be as long as open_as_dict_of_datasets.

@TomNicholas
Copy link
Contributor Author

TomNicholas commented Jul 3, 2024

I quite like open_groups! It's succinct, communicates that if you don't have groups you won't need this, doesn't use the word datatree, and is plural to indicate that you will get back multiple objects. Only downside is that it breaks the pattern that open_dataset opens a Dataset, because we don't have a "group" object.

@shoyer
Copy link
Member

shoyer commented Jul 3, 2024 via email

@eni-awowale
Copy link
Collaborator

Talked to Tom about this but thought I would post it here too. I am working on a PR for this!

@eni-awowale
Copy link
Collaborator

Okay so I have a question about getting open_groups to use a the specified engine. I noticed for open_dataset in xarray.backends.api we use plugins.guess_engine(filename_or_obj) (also here). Is that something we would want to do for open_groups?

@TomNicholas
Copy link
Contributor Author

Is that something we would want to do for open_groups?

I think guessing the engine would be useful, but remembering that if the implementation of open_groups relies on a codepath that only exists for certain backends (i.e. right now I think @aladinor only implemented the open_datatree method for netcdf and zarr engines), then it should raise a NotImplementedError.

Also note that that the way to do this should presumably be to add an open_groups method to the BackendEntrypoint ABC, move most of the implementation in #9014 to be inside that method instead, and refactor the open_datatree method to just be

def open_datatree(path, ...) -> DataTree:
    dict_of_datasets = self.open_groups(path, ...)

    # if the user is trying to open a file with groups that can't be represented as a tree it will raise here
    tree_root = DataTree.from_dict(dict_of_datasets)

    return tree_root

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement io topic-backends topic-DataTree Related to the implementation of a DataTree class
Projects
4 participants