Merge remote-tracking branch 'upstream/master' into groupby-plot

* upstream/master: allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
dcherian · Jan 14, 2020 · a834dde · a834dde
2 parents cb78770 + e0fd480
commit a834dde
Show file tree

Hide file tree

Showing 14 changed files with 240 additions and 30 deletions.
diff --git a/doc/data-structures.rst b/doc/data-structures.rst
@@ -353,6 +353,8 @@ setting) variables and attributes:
 This is particularly useful in an exploratory context, because you can
 tab-complete these variable names with tools like IPython.
 
+.. _dictionary_like_methods:
+
 Dictionary like methods
 ~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/doc/howdoi.rst b/doc/howdoi.rst
@@ -11,6 +11,8 @@ How do I ...
 
    * - How do I...
      - Solution
+   * - add a DataArray to my dataset as a new variable
+     - ``my_dataset[varname] = my_dataArray`` or :py:meth:`Dataset.assign` (see also :ref:`dictionary_like_methods`)
    * - add variables from other datasets to my dataset
      - :py:meth:`Dataset.merge`
    * - add a new dimension and/or coordinate
@@ -57,3 +59,4 @@ How do I ...
      - ``obj.dt.ceil``, ``obj.dt.floor``, ``obj.dt.round``. See :ref:`dt_accessor` for more.
    * - make a mask that is ``True`` where an object contains any of the values in a array
      - :py:meth:`Dataset.isin`, :py:meth:`DataArray.isin`
+
diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -37,13 +37,20 @@ New Features
 - Added the ``count`` reduction method to both :py:class:`~core.rolling.DatasetCoarsen`
   and :py:class:`~core.rolling.DataArrayCoarsen` objects. (:pull:`3500`)
   By `Deepak Cherian <https://github.com/dcherian>`_
+- Add `attrs_file` option in :py:func:`~xarray.open_mfdataset` to choose the
+  source file for global attributes in a multi-file dataset (:issue:`2382`,
+  :pull:`3498`) by `Julien Seguinot <https://github.com/juseg>_`.
+- :py:meth:`Dataset.swap_dims` and :py:meth:`DataArray.swap_dims`
+  now allow swapping to dimension names that don't exist yet. (:pull:`3636`)
+  By `Justus Magin <https://github.com/keewis>`_.
 - Extend :py:class:`core.accessor_dt.DatetimeAccessor` properties 
   and support `.dt` accessor for timedelta 
   via :py:class:`core.accessor_dt.TimedeltaAccessor` (:pull:`3612`)
   By `Anderson Banihirwe <https://github.com/andersy005>`_.
 
 Bug fixes
 ~~~~~~~~~
+
 - Fix :py:meth:`xarray.combine_by_coords` to allow for combining incomplete
   hypercubes of Datasets (:issue:`3648`).  By `Ian Bolliger
   <https://github.com/bolliger32>`_.
@@ -62,6 +69,16 @@ Bug fixes
   By `Tom Augspurger <https://github.com/TomAugspurger>`_.
 - Ensure :py:meth:`Dataset.quantile`, :py:meth:`DataArray.quantile` issue the correct error
   when ``q`` is out of bounds (:issue:`3634`) by `Mathias Hauser <https://github.com/mathause>`_.
+- Raise an error when trying to use :py:meth:`Dataset.rename_dims` to
+  rename to an existing name (:issue:`3438`, :pull:`3645`)
+  By `Justus Magin <https://github.com/keewis>`_.
+- :py:meth:`Dataset.rename`, :py:meth:`DataArray.rename` now check for conflicts with
+  MultiIndex level names.
+- :py:meth:`Dataset.merge` no longer fails when passed a `DataArray` instead of a `Dataset` object.
+  By `Tom Nicholas <https://github.com/TomNicholas>`_.
+- Fix a regression in :py:meth:`Dataset.drop`: allow passing any
+  iterable when dropping variables (:issue:`3552`, :pull:`3693`)
+  By `Justus Magin <https://github.com/keewis>`_.
 
 Documentation
 ~~~~~~~~~~~~~
@@ -80,9 +97,14 @@ Documentation
 - Added examples for :py:meth:`DataArray.quantile`, :py:meth:`Dataset.quantile` and
   ``GroupBy.quantile``. (:pull:`3576`)
   By `Justus Magin <https://github.com/keewis>`_.
+- Added example for :py:func:`~xarray.map_blocks`. (:pull:`3667`)
+  By `Riley X. Brady <https://github.com/bradyrx>`_.
 
 Internal Changes
 ~~~~~~~~~~~~~~~~
+- Make sure dask names change when rechunking by different chunk sizes. Conversely, make sure they
+  stay the same when rechunking by the same chunk size. (:issue:`3350`)
+  By `Deepak Cherian <https://github.com/dcherian>`_.
 - 2x to 5x speed boost (on small arrays) for :py:meth:`Dataset.isel`,
   :py:meth:`DataArray.isel`, and :py:meth:`DataArray.__getitem__` when indexing by int,
   slice, list of int, scalar ndarray, or 1-dimensional ndarray.

diff --git a/xarray/backends/api.py b/xarray/backends/api.py
@@ -718,6 +718,7 @@ def open_mfdataset(
     autoclose=None,
     parallel=False,
     join="outer",
+    attrs_file=None,
     **kwargs,
 ):
     """Open multiple files as a single dataset.
@@ -729,8 +730,8 @@ def open_mfdataset(
     ``combine_by_coords`` and ``combine_nested``. By default the old (now deprecated)
     ``auto_combine`` will be used, please specify either ``combine='by_coords'`` or
     ``combine='nested'`` in future. Requires dask to be installed. See documentation for
-    details on dask [1]_. Attributes from the first dataset file are used for the
-    combined dataset.
+    details on dask [1]_. Global attributes from the ``attrs_file`` are used
+    for the combined dataset.
 
     Parameters
     ----------
@@ -827,6 +828,10 @@ def open_mfdataset(
         - 'override': if indexes are of same size, rewrite indexes to be
           those of the first object with that dimension. Indexes for the same
           dimension must have the same size in all objects.
+    attrs_file : str or pathlib.Path, optional
+        Path of the file used to read global attributes from.
+        By default global attributes are read from the first file provided,
+        with wildcard matches sorted by filename.
     **kwargs : optional
         Additional arguments passed on to :py:func:`xarray.open_dataset`.
 
@@ -961,7 +966,15 @@ def open_mfdataset(
         raise
 
     combined._file_obj = _MultiFileCloser(file_objs)
-    combined.attrs = datasets[0].attrs
+
+    # read global attributes from the attrs_file or from the first dataset
+    if attrs_file is not None:
+        if isinstance(attrs_file, Path):
+            attrs_file = str(attrs_file)
+        combined.attrs = datasets[paths.index(attrs_file)].attrs
+    else:
+        combined.attrs = datasets[0].attrs
+
     return combined
 
 

diff --git a/xarray/core/dataarray.py b/xarray/core/dataarray.py
@@ -1480,8 +1480,7 @@ def swap_dims(self, dims_dict: Mapping[Hashable, Hashable]) -> "DataArray":
         ----------
         dims_dict : dict-like
             Dictionary whose keys are current dimension names and whose values
-            are new names. Each value must already be a coordinate on this
-            array.
+            are new names.
 
         Returns
         -------
@@ -1504,6 +1503,13 @@ def swap_dims(self, dims_dict: Mapping[Hashable, Hashable]) -> "DataArray":
         Coordinates:
             x        (y) <U1 'a' 'b'
           * y        (y) int64 0 1
+        >>> arr.swap_dims({"x": "z"})
+        <xarray.DataArray (z: 2)>
+        array([0, 1])
+        Coordinates:
+            x        (z) <U1 'a' 'b'
+            y        (z) int64 0 1
+        Dimensions without coordinates: z
 
         See Also
         --------
@@ -2362,7 +2368,7 @@ def to_dict(self, data: bool = True) -> dict:
         naming conventions.
 
         Converts all variables and attributes to native Python objects.
-        Useful for coverting to json. To avoid datetime incompatibility
+        Useful for converting to json. To avoid datetime incompatibility
         use decode_times=False kwarg in xarrray.open_dataset.
 
         Parameters

diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
@@ -85,11 +85,16 @@
     either_dict_or_kwargs,
     hashable,
     is_dict_like,
-    is_list_like,
     is_scalar,
     maybe_wrap_array,
 )
-from .variable import IndexVariable, Variable, as_variable, broadcast_variables
+from .variable import (
+    IndexVariable,
+    Variable,
+    as_variable,
+    broadcast_variables,
+    assert_unique_multiindex_level_names,
+)
 
 if TYPE_CHECKING:
     from ..backends import AbstractDataStore, ZarrStore
@@ -1748,7 +1753,10 @@ def maybe_chunk(name, var, chunks):
             if not chunks:
                 chunks = None
             if var.ndim > 0:
-                token2 = tokenize(name, token if token else var._data)
+                # when rechunking by different amounts, make sure dask names change
+                # by provinding chunks as an input to tokenize.
+                # subtle bugs result otherwise. see GH3350
+                token2 = tokenize(name, token if token else var._data, chunks)
                 name2 = f"{name_prefix}{name}-{token2}"
                 return var.chunk(chunks, name=name2, lock=lock)
             else:
@@ -2780,6 +2788,7 @@ def rename(
         variables, coord_names, dims, indexes = self._rename_all(
             name_dict=name_dict, dims_dict=name_dict
         )
+        assert_unique_multiindex_level_names(variables)
         return self._replace(variables, coord_names, dims=dims, indexes=indexes)
 
     def rename_dims(
@@ -2791,7 +2800,8 @@ def rename_dims(
         ----------
         dims_dict : dict-like, optional
             Dictionary whose keys are current dimension names and
-            whose values are the desired names.
+            whose values are the desired names. The desired names must
+            not be the name of an existing dimension or Variable in the Dataset.
         **dims, optional
             Keyword form of ``dims_dict``.
             One of dims_dict or dims must be provided.
@@ -2809,12 +2819,17 @@ def rename_dims(
         DataArray.rename
         """
         dims_dict = either_dict_or_kwargs(dims_dict, dims, "rename_dims")
-        for k in dims_dict:
+        for k, v in dims_dict.items():
             if k not in self.dims:
                 raise ValueError(
                     "cannot rename %r because it is not a "
                     "dimension in this dataset" % k
                 )
+            if v in self.dims or v in self:
+                raise ValueError(
+                    f"Cannot rename {k} to {v} because {v} already exists. "
+                    "Try using swap_dims instead."
+                )
 
         variables, coord_names, sizes, indexes = self._rename_all(
             name_dict={}, dims_dict=dims_dict
@@ -2868,8 +2883,7 @@ def swap_dims(
         ----------
         dims_dict : dict-like
             Dictionary whose keys are current dimension names and whose values
-            are new names. Each value must already be a variable in the
-            dataset.
+            are new names.
 
         Returns
         -------
@@ -2898,6 +2912,16 @@ def swap_dims(
         Data variables:
             a        (y) int64 5 7
             b        (y) float64 0.1 2.4
+        >>> ds.swap_dims({"x": "z"})
+        <xarray.Dataset>
+        Dimensions:  (z: 2)
+        Coordinates:
+            x        (z) <U1 'a' 'b'
+            y        (z) int64 0 1
+        Dimensions without coordinates: z
+        Data variables:
+            a        (z) int64 5 7
+            b        (z) float64 0.1 2.4
 
         See Also
         --------
@@ -2914,7 +2938,7 @@ def swap_dims(
                     "cannot swap from dimension %r because it is "
                     "not an existing dimension" % k
                 )
-            if self.variables[v].dims != (k,):
+            if v in self.variables and self.variables[v].dims != (k,):
                 raise ValueError(
                     "replacement dimension %r is not a 1D "
                     "variable along the old dimension %r" % (v, k)
@@ -2923,7 +2947,7 @@ def swap_dims(
         result_dims = {dims_dict.get(dim, dim) for dim in self.dims}
 
         coord_names = self._coord_names.copy()
-        coord_names.update(dims_dict.values())
+        coord_names.update({dim for dim in dims_dict.values() if dim in self.variables})
 
         variables: Dict[Hashable, Variable] = {}
         indexes: Dict[Hashable, pd.Index] = {}
@@ -3525,7 +3549,7 @@ def update(self, other: "CoercibleMapping", inplace: bool = None) -> "Dataset":
 
     def merge(
         self,
-        other: "CoercibleMapping",
+        other: Union["CoercibleMapping", "DataArray"],
         inplace: bool = None,
         overwrite_vars: Union[Hashable, Iterable[Hashable]] = frozenset(),
         compat: str = "no_conflicts",
@@ -3582,6 +3606,7 @@ def merge(
             If any variables conflict (see ``compat``).
         """
         _check_inplace(inplace)
+        other = other.to_dataset() if isinstance(other, xr.DataArray) else other
         merge_result = dataset_merge_method(
             self,
             other,
@@ -3664,7 +3689,7 @@ def drop(self, labels=None, dim=None, *, errors="raise", **labels_kwargs):
                 raise ValueError("cannot specify dim and dict-like arguments.")
             labels = either_dict_or_kwargs(labels, labels_kwargs, "drop")
 
-        if dim is None and (is_list_like(labels) or is_scalar(labels)):
+        if dim is None and (is_scalar(labels) or isinstance(labels, Iterable)):
             warnings.warn(
                 "dropping variables using `drop` will be deprecated; using drop_vars is encouraged.",
                 PendingDeprecationWarning,
@@ -4127,7 +4152,7 @@ def combine_first(self, other: "Dataset") -> "Dataset":
 
         Returns
         -------
-        DataArray
+        Dataset
         """
         out = ops.fillna(self, other, join="outer", dataset_join="outer")
         return out
@@ -4641,7 +4666,7 @@ def to_dict(self, data=True):
         conventions.
 
         Converts all variables and attributes to native Python objects
-        Useful for coverting to json. To avoid datetime incompatibility
+        Useful for converting to json. To avoid datetime incompatibility
         use decode_times=False kwarg in xarrray.open_dataset.
 
         Parameters

diff --git a/xarray/core/parallel.py b/xarray/core/parallel.py
@@ -154,6 +154,48 @@ def map_blocks(
     --------
     dask.array.map_blocks, xarray.apply_ufunc, xarray.Dataset.map_blocks,
     xarray.DataArray.map_blocks
+
+    Examples
+    --------
+
+    Calculate an anomaly from climatology using ``.groupby()``. Using
+    ``xr.map_blocks()`` allows for parallel operations with knowledge of ``xarray``,
+    its indices, and its methods like ``.groupby()``.
+
+    >>> def calculate_anomaly(da, groupby_type='time.month'):
+    ...     # Necessary workaround to xarray's check with zero dimensions
+    ...     # https://github.com/pydata/xarray/issues/3575
+    ...     if sum(da.shape) == 0:
+    ...         return da
+    ...     gb = da.groupby(groupby_type)
+    ...     clim = gb.mean(dim='time')
+    ...     return gb - clim
+    >>> time = xr.cftime_range('1990-01', '1992-01', freq='M')
+    >>> np.random.seed(123)
+    >>> array = xr.DataArray(np.random.rand(len(time)),
+    ...                      dims="time", coords=[time]).chunk()
+    >>> xr.map_blocks(calculate_anomaly, array).compute()
+    <xarray.DataArray (time: 24)>
+    array([ 0.12894847,  0.11323072, -0.0855964 , -0.09334032,  0.26848862,
+            0.12382735,  0.22460641,  0.07650108, -0.07673453, -0.22865714,
+           -0.19063865,  0.0590131 , -0.12894847, -0.11323072,  0.0855964 ,
+            0.09334032, -0.26848862, -0.12382735, -0.22460641, -0.07650108,
+            0.07673453,  0.22865714,  0.19063865, -0.0590131 ])
+    Coordinates:
+      * time     (time) object 1990-01-31 00:00:00 ... 1991-12-31 00:00:00
+
+    Note that one must explicitly use ``args=[]`` and ``kwargs={}`` to pass arguments
+    to the function being applied in ``xr.map_blocks()``:
+
+    >>> xr.map_blocks(calculate_anomaly, array, kwargs={'groupby_type': 'time.year'})
+    <xarray.DataArray (time: 24)>
+    array([ 0.15361741, -0.25671244, -0.31600032,  0.008463  ,  0.1766172 ,
+           -0.11974531,  0.43791243,  0.14197797, -0.06191987, -0.15073425,
+           -0.19967375,  0.18619794, -0.05100474, -0.42989909, -0.09153273,
+            0.24841842, -0.30708526, -0.31412523,  0.04197439,  0.0422506 ,
+            0.14482397,  0.35985481,  0.23487834,  0.12144652])
+    Coordinates:
+        * time     (time) object 1990-01-31 00:00:00 ... 1991-12-31 00:00:00
     """
 
     def _wrapper(func, obj, to_array, args, kwargs):

diff --git a/xarray/core/utils.py b/xarray/core/utils.py
@@ -547,7 +547,12 @@ def __eq__(self, other) -> bool:
         return False
 
     def __hash__(self) -> int:
-        return hash((ReprObject, self._value))
+        return hash((type(self), self._value))
+
+    def __dask_tokenize__(self):
+        from dask.base import normalize_token
+
+        return normalize_token((type(self), self._value))
 
 
 @contextlib.contextmanager