Don't set encoding attributes on bounds variables. #2965

dcherian · 2019-05-15T16:00:44Z

Here's a proposed fix for #2436 and #2921. Ping @spencerkclark @mathause @klindsay28

Removes certain attributes from bounds variables on encode.
open_mfdataset: Sets encoding on variables based on encoding in first file.

Closes to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

Fixes pydata#2921 1. Removes certain attributes from bounds variables on encode. 2. open_mfdataset: Sets encoding on variables based on encoding in first file.

pep8speaks · 2019-05-15T16:00:52Z

Hello @dcherian! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-24 21:51:22 UTC

mathause · 2019-05-15T16:19:04Z

Thanks for putting this together.

I think this can work as long as time has an encoding - however, if it doesn't we might end up with different units for time and time_bounds again. I am not sure, actually, just something to check.

dcherian · 2019-05-15T16:26:06Z

Yeah, this still doesn't work as @klindsay28 just pointed out to me.

I still don't understanding how the units attribute/encoding is being treated. I'll get back to this soon.

spencerkclark

Thanks for taking this on @dcherian. I think this is on the right track. Do you have an explicit example of the edge case(s) you still have in mind?

xarray/conventions.py

dcherian · 2019-05-17T14:42:17Z

Thanks @spencerkclark, I'm going by the example in #2921. I've added that now as a test. I also test for the slight non-compliance case where time_bounds has different units already set.

@mathause would this test catch the issue you were encountering?

mathause · 2019-05-17T16:03:14Z

Great - this does indeed solve my problem (read multiple files with open_mfdataset and write as one). However, it still 'fails' - is not cf-compiant to be more precise - for a new dataset created in xarray, see:

import xarray as xr
import pandas as pd
time = pd.date_range('2000-01-16', periods=1)
time_bounds = pd.date_range('2000-01-01', periods=2, freq='MS')

ds = xr.Dataset(dict(time=time, time_bounds=time_bounds))
ds.time.attrs['bounds'] = 'time_bounds'

xr.conventions.cf_encoder(ds.variables, ds.attrs)

(compare the units). This might be difficult to solve, as xarray currently assumes all variables can be encoded independently.

spencerkclark · 2019-05-17T16:36:39Z

This might be difficult to solve, as xarray currently assumes all variables can be encoded independently.

I was hoping someone wouldn't bring that case up :). But I agree I think it's something we should discuss.

I'm sort of torn on whether it is worth the complexity. Part of me feels like if a user is concerned/clever enough to explicitly create such linking attributes (e.g. 'bounds' pointing to 'time_bounds') in their Datasets that we could leave it as their responsibility to make sure the 'units' and 'calendar' encodings match for each of the variables as well. For example, while a bit verbose, this would work as desired:

import xarray as xr
import pandas as pd
time = pd.date_range('2000-01-16', periods=1)
time_bounds = pd.date_range('2000-01-01', periods=2, freq='MS')

ds = xr.Dataset(dict(time=time, time_bounds=time_bounds))
ds.time.encoding['bounds'] = 'time_bounds'
ds.time.encoding['units'] = 'days since 2000-01-01'
ds.time.encoding['calendar'] = 'proleptic_gregorian'

ds.time_bounds.encoding['units'] = ds.time.encoding['units']
ds.time_bounds.encoding['calendar'] = ds.time.encoding['calendar']

That said, that case is something that in principle we could address, and maybe it is worth thinking about if we ever consider increasing the functionality related to cell bounds coordinates (e.g. #1475).

mathause · 2019-05-17T16:51:00Z

xarray/conventions.py

+                if attr in new_vars[bounds].attrs and attr in var.attrs:
+                    if new_vars[bounds].attrs[attr] == var.attrs[attr]:
+                        new_vars[bounds].attrs.pop(attr)
+


Could we issue a warning here?

else: warning.warn("The attribute 'units' is not the same in the variable " "'time' and it's associated bounds 'time_bnds'," " which is not cf-compliant.")

or some such

Do we need to? xarray allows writing CF-non-compliant files anyway...

I think the warning you added is the perfect approach. It will still be issued in the case of @mathause's example, but will still allow a user to write a non-CF-compliant file without a warning if the encoding attributes do not need to be computed on the fly.

mathause · 2019-05-17T16:52:05Z

Yes, I agree - I don't intend to actually create such a dataset by hand anytime soon ;) but it would be super hard to track this issue down - can we issue a warning instead? See my inline comment.

dcherian · 2019-05-17T17:16:19Z

This case might actually be really easy to fix since we already have xr.conventions._update_bounds_attributes() from Fabien's previous PR (see below).

I think we would just need to call this prior to calling encode_cf_variable() in

xarray/xarray/conventions.py

Line 622 in 612d390

new_vars = OrderedDict((k, encode_cf_variable(v, name=k))

def _update_bounds_attributes(variables):
    """Adds time attributes to time bounds variables.

    Variables handling time bounds ("Cell boundaries" in the CF
    conventions) do not necessarily carry the necessary attributes to be
    decoded. This copies the attributes from the time variable to the
    associated boundaries.

    See Also:

    http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/
         cf-conventions.html#cell-boundaries

    https://github.com/pydata/xarray/issues/2565
    """


    # For all time variables with bounds
    for v in variables.values():
        attrs = v.attrs
        has_date_units = 'units' in attrs and 'since' in attrs['units']
        if has_date_units and 'bounds' in attrs:
            if attrs['bounds'] in variables:
                bounds_attrs = variables[attrs['bounds']].attrs
                bounds_attrs.setdefault('units', attrs['units'])
                if 'calendar' in attrs:
                    bounds_attrs.setdefault('calendar', attrs['calendar'])

spencerkclark · 2019-05-17T17:53:25Z

I think we would just need to call this prior to calling encode_cf_variable()

I'm afraid it's not quite that simple :).

In cases where someone creates a datetime-like variable in memory (e.g. with pd.date_range in @mathause's example), unless they explicitly add encoding attributes, 'units' and 'calendar' will need to be computed on the fly. This happens inside of encode_cf_variable, currently on a per-variable basis, so trying to make sure the attributes are equal before that step will not help, because they might not exist.

dcherian · 2019-05-19T02:40:56Z

Thanks @spencerkclark & @mathause. I now understand the issue better. #2921 was confusing in that @klindsay28 was trying to write an encoded dataset.

I've updated the tests to use @mathuse's example. The code now updates the time_bounds variable with the encoding 'units' and 'calendar' of the time variable. It also throws a warning when encoding.units is not specified for variables with a bounds attribute.

Do you think this is a good approach?

spencerkclark

Thanks @dcherian -- @mathause's suggestion of adding a warning was helpful, and I think your addition to potentially propagate the 'units' and 'calendar' encoding parameters from the root to the bounds coordinate is also good. I just have a few more comments and a question.

xarray/conventions.py

spencerkclark · 2019-05-19T12:22:30Z

xarray/conventions.py

+
+    # For all time variables with bounds
+    for v in variables.values():
+        attrs = v.attrs


A general question -- would we consider 'bounds' to be an encoding parameter (like 'units' or 'calendar')? In other words should we expect it to be in the encoding dictionary or attrs dictionary at this stage? I feel like it may be more intuitive as part of encoding, but currently I know that we don't treat it that way when decoding files.

In my mental model, encoding attributes are those that control on-disk representation of the data. bounds counts as an attr to my mind since it's an attribute that links the variable to another variable.

A definition or list of what goes in encoding and what goes in attrs would make a good addition to the docs.

In my mental model, encoding attributes are those that control on-disk representation of the data.

I think this is fair; I guess I was going off of the mental model of encoding parameters defined as "attributes that are potentially required for decoding all the variables in a file," in which case 'bounds' could qualify. I think your definition is probably cleaner, because it requires that encoding parameters control how the variable they are attached to is represented on disk (as opposed to another variable).

xarray/tests/test_coding_times.py

xarray/backends/api.py

* master: (31 commits) Add quantile method to GroupBy (pydata#2828) rolling_exp (nee ewm) (pydata#2650) Ensure explicitly indexed arrays are preserved (pydata#3027) add back dask-dev tests (pydata#3025) ENH: keepdims=True for xarray reductions (pydata#3033) Revert cmap fix (pydata#3038) Add "errors" keyword argument to drop() and drop_dims() (pydata#2994) (pydata#3028) More consistency checks (pydata#2859) Check types in travis (pydata#3024) Update issue templates (pydata#3019) Add pytest markers to avoid warnings (pydata#3023) Feature/merge errormsg (pydata#2971) More support for missing_value. (pydata#2973) Use flake8 rather than pycodestyle (pydata#3010) Pandas labels deprecation (pydata#3016) Pytest capture uses match, not message (pydata#3011) dask-dev tests to allowed failures in travis (pydata#3014) Fix 'to_masked_array' computing dask arrays twice (pydata#3006) str accessor (pydata#2991) fix safe_cast_to_index (pydata#3001) ...

… fix/bounds_encode_2 * 'fix/bounds_encode_2' of github.com:dcherian/xarray:

Issue pydata#2921 is about mismatching time units between a time variable and its "bounds" companion. However, pydata#2965 does more than fixing pydata#2921, it removes all double attributes from "bounds" variables which has the undesired side effect that there is currently no way to save them to netcdf with xarray. Since the mentioned link is a recommendation and not a hard requirement for CF compliance, these attributes should be left to the caller to prepare the dataset variables appropriately if required. Reduces the amount of surprise that attributes are not written to disk and fixes pydata#8368.

Don't set attributes on bounds variables.

ff2fd49

Fixes pydata#2921 1. Removes certain attributes from bounds variables on encode. 2. open_mfdataset: Sets encoding on variables based on encoding in first file.

dcherian added 2 commits May 15, 2019 10:01

remove whitespace stuff.

302ab63

Make sure variable exists in first file before assigning encoding

931f973

dcherian added 3 commits May 15, 2019 13:02

Make sure we iterate over coords too.

5526fe4

lint fix.

3889ba6

docs/comment fixes.

6f2bc05

spencerkclark reviewed May 17, 2019

View reviewed changes

xarray/conventions.py Outdated Show resolved Hide resolved

dcherian added 2 commits May 17, 2019 08:18

mfdataset encoding test.

b903e89

time_bounds attrs test + allow for slight CF non-compliance.

70c8c5c

dcherian changed the title ~~[WIP] Don't set attributes on bounds variables.~~ Don't set attributes on bounds variables. May 17, 2019

mathause reviewed May 17, 2019

View reviewed changes

I need to deal with encoding!

d637e9e

dcherian changed the title ~~Don't set attributes on bounds variables.~~ [WIP] Don't set encoding attributes on bounds variables. May 19, 2019

dcherian added 2 commits May 18, 2019 20:34

minor fixes.

2f1dd25

another minor fix.

12f3e55

spencerkclark reviewed May 19, 2019

View reviewed changes

dcherian added 2 commits May 19, 2019 14:13

review fixes.

f8789e7

lint fixes.

e0c49a4

dcherian changed the title ~~[WIP] Don't set encoding attributes on bounds variables.~~ Don't set encoding attributes on bounds variables. May 20, 2019

dcherian mentioned this pull request May 21, 2019

0.12.2 release #2977

Closed

15 tasks

Merge branch 'master' into fix/bounds_encode_2

674e5a5

shoyer reviewed Jun 23, 2019

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

dcherian and others added 4 commits June 24, 2019 17:47

Remove encoding changes and xfail test.

34d0e60

Merge branch 'fix/bounds_encode_2' of github.com:dcherian/xarray into…

b1dcf1d

… fix/bounds_encode_2 * 'fix/bounds_encode_2' of github.com:dcherian/xarray:

Update whats-new.rst

c63cf33

dcherian merged commit 76adf13 into pydata:master Jun 25, 2019

dcherian deleted the fix/bounds_encode_2 branch June 25, 2019 00:24

st-bender mentioned this pull request Apr 9, 2024

to_netcdf: Unexpected drop of "units" attribute of attached "bounds" #8368

Open

5 tasks

st-bender mentioned this pull request Apr 10, 2024

Keep attributes for "bounds" variables #8924

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't set encoding attributes on bounds variables. #2965

Don't set encoding attributes on bounds variables. #2965

dcherian commented May 15, 2019 •

edited

Loading

pep8speaks commented May 15, 2019 •

edited

Loading

mathause commented May 15, 2019

dcherian commented May 15, 2019

spencerkclark left a comment

dcherian commented May 17, 2019

mathause commented May 17, 2019

spencerkclark commented May 17, 2019

mathause May 17, 2019 •

edited

Loading

dcherian May 19, 2019

spencerkclark May 19, 2019

mathause commented May 17, 2019

dcherian commented May 17, 2019

spencerkclark commented May 17, 2019

dcherian commented May 19, 2019 •

edited

Loading

spencerkclark left a comment

spencerkclark May 19, 2019

dcherian May 19, 2019

spencerkclark May 19, 2019

Don't set encoding attributes on bounds variables. #2965

Don't set encoding attributes on bounds variables. #2965

Conversation

dcherian commented May 15, 2019 • edited Loading

pep8speaks commented May 15, 2019 • edited Loading

Comment last updated at 2019-06-24 21:51:22 UTC

mathause commented May 15, 2019

dcherian commented May 15, 2019

spencerkclark left a comment

Choose a reason for hiding this comment

dcherian commented May 17, 2019

mathause commented May 17, 2019

spencerkclark commented May 17, 2019

mathause May 17, 2019 • edited Loading

Choose a reason for hiding this comment

dcherian May 19, 2019

Choose a reason for hiding this comment

spencerkclark May 19, 2019

Choose a reason for hiding this comment

mathause commented May 17, 2019

dcherian commented May 17, 2019

spencerkclark commented May 17, 2019

dcherian commented May 19, 2019 • edited Loading

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark May 19, 2019

Choose a reason for hiding this comment

dcherian May 19, 2019

Choose a reason for hiding this comment

spencerkclark May 19, 2019

Choose a reason for hiding this comment

dcherian commented May 15, 2019 •

edited

Loading

pep8speaks commented May 15, 2019 •

edited

Loading

mathause May 17, 2019 •

edited

Loading

dcherian commented May 19, 2019 •

edited

Loading