-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiIndex serialization to NetCDF #1077
Comments
This is a good question -- I don't think we've figured it out yet. Maybe you have ideas? The main question (to me) is whether we should store raw values for each level in a MultiIndex (closer to what you see), or category encoded values (closer to the MultiIndex implementation). To more concrete, here it what these look like for an example MultiIndex:
Advantages of storing raw values:
Advantages of storing category encoded values:
Perhaps the best approach would be a hybrid: store raw values, as well as de-duplicated integer codes specifying the order of values in each MultiIndex Either way, we will need to store an attribute or two with metadata for how to restore the levels (e.g., |
Personally I'd vote for the category encoded values. If I make files with a newer xarray, I'll be reading them later with the same (or newer) xarray and I'd definitely want the exact MultiIndex back. I don't want to be too self-centered in my perspective in all of this. But my applications are definitely in the large-scale scientific computing area that seems to be the community norm for xarray, so I would guess many others would have a similar situation. I generate data that are associated with nodes or elements in a mesh. The mesh is naturally split into named regions. Sometimes I need to operate on the entire dataset (including all regions) and sometimes I want to select one or more regions. So I make a MultiIndex where the first index is the region name strings, and the second index is the node (or element) number inside the region (i.e. starts over counting from 1 for each region). So the full index is 1e5 to 1e7 long, of which there are only maybe a few hundred unique values in the string column. I would think that would greatly benefit from the category-encoded storage. And fast and reliable reconstruction of the MultiIndex is a big plus. Does this seem like a common user scenario? The one thing I'm wondering is, what happens in an application like this if you select on one index (say, all data rows with region_name='FOOBAR-1') from the HDF5 file before doing anything else? Would it hard to make the MultiIndex/NetCDF reader smart enough not to reconstruct the whole MultiIndex before picking out the relevant rows? And, related question for us to think about, how would we make this all play nicely with dask? Sorry for the long post. I've been very impressed and happy working with xarray, and I'm just eager to get the last bit of features I need so I can really start pushing my colleagues into using it. :) Nuts and bolts questions: So each of index.levels would be easy to store as its own little DataArray, yeah? Then would each of the index.labels be in its own DataArray, or would you want them all in the same 2D DataArray? And then would the actual data in the original DataArray just have a generic integer index as a placeholder, to be replaced by the MultiIndex? For these dummy DataArrays and the multiindex_levels metadata attr, how do you feel about using a single leading underscore in the name? If I were to low-level grunge around in the file for some reason, that would indicate to me that they are private-by-convention implementation details. |
Point taken -- let's see what others think! One consideration in favor of this is that it will soon be very easy to switch a MultiIndex back into separate coordinate variables, which could be our recommendation for how to save netCDF files for maximum portability.
We could do this, but note that we are contemplating switching xarray to always load indexes into memory eagerly, which would negate that advantage. See this PR and mailing list discussion:
pandas stores levels separately, automatically putting each of them in the smallest possible dtype (
Just a note: for interacting with backends, we use This means that we don't need the generic integer placeholder index (which will also be going away shortly in general, see #1017). |
I have the exact same applications than yours @tippetts, but I also would like to write netCDF files that are compatible with other tools than just xarray. With the category encoded values as the default behavior, my concern is that xarray users may be unaware that they generate netCDF files which have limited compatibility with 3rd-party tools, unless a clear warning is given in the documentation.
This should be fine, but maybe it would be nice to allow handling this automatically (at read and write) by using a specific |
So if I'm properly understanding and synthesizing your ( @benbovy and @shoyer ) comments: We want the hybrid format for maximum compatibility, with the MultiIndex split into separate 1D raw value coordinates. Using the example above, these would be @shoyer , I see what you mean about @benbovy , what does the |
|
Yes that's what I mean, something like Trying to summarize the potential use cases mentioned above:
Note that point 3 is just for more convenience, I wouldn't mind too much having to manually reset / refactorize the multi-index in that case. We indeed don't need options if point 3 is not important. |
Here's a new, related question: @shoyer , do you have any interest in adding a class to xarray that contains a hierarchical tree of Datasets, analogous to the groups in a netCDF or HDF5 file? Then opening or saving such an object would be an easy but powerful one-liner. Or is that something you would rather leave to someone else's module? |
Maybe? A minimal class for managing groups in an open file could
|
I've started writing a Currently, this is a minimal class that just implements an "immutable" tree of datasets (it only allows adding child nodes so that we can build a tree). |
Would it be too simplistic to think that I think that is similar to what you've done, @benbovy , but with inheritance rather than composition. I understand that is an often disfavored design pattern, but it would it make sense in this case and keep the overall xarray interface simple? |
Yes I'm actually not very happy with the @shoyer do you think that a PR for such a |
Also, as written I don't see any aspects that need to live in core xarray -- it seems that it can mostly be done with the external interface. So I would suggest the separate package. |
Yes, I suppose it doesn't really need to live in core xarray, unless you did want to allow a @benbovy , do you plan to put your |
I would love to have this functionality as well. Unfortunately, I'm not knowledgeable enough to help decide on the internal structure for multiindeces though. |
Let's recap the options, which I'll illustrate for the second level of my MultiIndex from above (#1077 (comment)):
3 uses only slightly less memory than 2 and can be easily achieved with 4 looks like a faithful roundtrip, but actually isn't in some rare edge cases. That seems like a recipe for disaster, so it should be OK. This leaves 1 and 2. Both are reasonably performant and roundtrip xarray objects with complete fidelity, so I would be happy with either them. In principle we could even support both, with an argument to switch between the modes (one would need to be the default). My inclination is start with only supporting 1, because it has a potentially large advantage from a speed/memory perspective, and it's easy to achieve the "raw values" representation with |
This code isn't particularly pretty, and I'm not sure if it handles all cases, but it enables serialization of MultiIndex indices by calling
|
I now came across this issue, which still seems to be open. Are the statements made earlier still valid? Are there any concrete plans maybe to fix this in the near future? |
Once we finish #1603, that may change our perspective here a little bit (and could indirectly solve this problem). |
This seems to be possible following http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#compression-by-gathering Here is a quick proof of concept: import numpy as np
import pandas as pd
import xarray as xr
# example 1
ds = xr.Dataset(
{"landsoilt": ("landpoint", np.random.randn(4))},
{
"landpoint": pd.MultiIndex.from_product(
[["a", "b"], [1, 2]], names=("lat", "lon")
)
},
)
# example 2
# ds = xr.Dataset(
# {"landsoilt": ("landpoint", np.random.randn(4))},
# {
# "landpoint": pd.MultiIndex.from_arrays(
# [["a", "b", "c", "d"], [1, 2, 4, 10]], names=("lat", "lon")
# )
# },
# )
# encode step
# detect using isinstance(index, pd.MultiIndex)
idxname = "landpoint"
encoded = ds.reset_index(idxname)
coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
for coord in coords:
encoded[coord] = coords[coord].values
shape = [encoded.sizes[coord] for coord in coords]
encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names)
# decode step
# detect using "compress" in var.attrs
idxname = "landpoint"
names = encoded[idxname].attrs["compress"].split(" ")
shape = [encoded.sizes[dim] for dim in names]
indices = np.unravel_index(encoded.landpoint.values, shape)
arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)]
mindex = pd.MultiIndex.from_arrays(arrays)
decoded = xr.Dataset({}, {idxname: mindex})
decoded["landsoilt"] = (idxname, encoded["landsoilt"].values)
xr.testing.assert_identical(decoded, ds)
|
@dcherian I think just using |
I agree with @fujiisoup. I think this "compression-by-gathering" representation makes more sense for sparse arrays than for a MultiIndex, per se. That said, MultiIndex and sparse arrays are basically two sides of the same idea. In the long term, it might make sense to try only support one of the two. |
I may be missing something but @fujiisoup's concern is addressed by the scheme in the CF conventions.
The information about ordering is stored as 1D indexes of an ND array; constructed using
For example, see the dimension coordinate
Here is a cleaned up version of the code for easy testing import numpy as np
import pandas as pd
import xarray as xr
def encode_multiindex(ds, idxname):
encoded = ds.reset_index(idxname)
coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
for coord in coords:
encoded[coord] = coords[coord].values
shape = [encoded.sizes[coord] for coord in coords]
encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names)
return encoded
def decode_to_multiindex(encoded, idxname):
names = encoded[idxname].attrs["compress"].split(" ")
shape = [encoded.sizes[dim] for dim in names]
indices = np.unravel_index(encoded.landpoint.values, shape)
arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)]
mindex = pd.MultiIndex.from_arrays(arrays)
decoded = xr.Dataset({}, {idxname: mindex})
for varname in encoded.data_vars:
if idxname in encoded[varname].dims:
decoded[varname] = (idxname, encoded[varname].values)
return decoded
ds1 = xr.Dataset(
{"landsoilt": ("landpoint", np.random.randn(4))},
{
"landpoint": pd.MultiIndex.from_product(
[["a", "b"], [1, 2]], names=("lat", "lon")
)
},
)
ds2 = xr.Dataset(
{"landsoilt": ("landpoint", np.random.randn(4))},
{
"landpoint": pd.MultiIndex.from_arrays(
[["a", "b", "c", "d"], [1, 2, 4, 10]], names=("lat", "lon")
)
},
)
ds3 = xr.Dataset(
{"landsoilt": ("landpoint", np.random.randn(4))},
{
"landpoint": pd.MultiIndex.from_arrays(
[["a", "b", "b", "a"], [1, 2, 1, 2]], names=("lat", "lon")
)
},
)
idxname = "landpoint"
for dataset in [ds1, ds2, ds3]:
xr.testing.assert_identical(
decode_to_multiindex(encode_multiindex(dataset, idxname), idxname), dataset
) |
@dcherian. I think the use of this convention is the best idea to save MultiIndex in netCDF. |
It still isn't clear to me why this is a better representation for a MultiIndex than a sparse array. I guess it could work fine for either, but we would need to pick a convention. |
@shoyer I now understand your earlier comment. I agree that it should work with both sparse and MultiIndex but as such there's no way to decide whether this should be decoded to a sparse array or a MultiIndexed dense array. Following your comment in #3213 (comment)
How about using this "compression by gathering" idea for MultiIndexed dense arrays and "indexed ragged arrays" for sparse arrays? I do not know the internals of PS: CF convention for "indexed ragged arrays" is here: http://cfconventions.org/cf-conventions/cf-conventions.html#_indexed_ragged_array_representation |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
I added the "compression by gathering" scheme to cf-xarray. |
Hi everyone, first of all, thanks for your amazing work! I came across this issue today because I have a dataset with multiple variables and multiple multi index dimensions, some of which aren't used in some variable. I had to slightly adapt the workaround posted by @dcherian to get things to work. I'll post it here if someone else finds the patch useful. I'm not sure if it would be a viable fix for the issue though, let me know if it is and I'll open a PR. def encode_multiindex(ds, idxname):
encoded = ds.reset_index(idxname)
coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
for coord in coords:
encoded[coord] = coords[coord].values
shape = [encoded.sizes[coord] for coord in coords]
encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names)
return encoded
def decode_to_multiindex(encoded, idxname):
names = encoded[idxname].attrs["compress"].split(" ")
shape = [encoded.sizes[dim] for dim in names]
indices = np.unravel_index(encoded[idxname].values, shape)
arrays = np.array([encoded[dim].values[index] for dim, index in zip(names, indices)])
mindex = pd.MultiIndex.from_arrays(arrays, names=names)
decoded = xr.Dataset(
{},
dict(
**{idxname: mindex},
**{dim: encoded.coords[dim] for dim in encoded.dims if dim not in [idxname] + names}
)
)
for varname in encoded.data_vars:
if idxname in encoded[varname].dims:
decoded[varname] = (encoded[varname].dims, encoded[varname].values)
else:
decoded[varname] = encoded[varname]
return decoded |
Thanks @lucianopaz I fixed some errors when I added it to cf-xarray It would be good to see if that version works for you. |
Should we close this? Now that there is a clear error message redirecting to either Once we drop the multi-index dimension coordinate, serialization by row level values should just work without explicitly resetting the index. |
I think so. Especially since we are moving away from special casing MultiIndex |
For the last remaining item on #719, do you (especially @shoyer and @benbovy ) have any thoughts formed about how the MultiIndex will be represented inside the netCDF4/HDF5 file? And how it will be recognized and reconstructed when read back in?
I am in the process of converting a lot of data with multiIndices into HDF5, which I will want to use xarray for reading later. So I would very much like to do it in a way that will be compatible once xarray has the MultiIndex to/from NetCDF built in. Thanks!
The text was updated successfully, but these errors were encountered: