Merge branch 'master' into map_blocks_2

* master: Fix whats-new date :/ Revert to dev version Release v0.13.0 auto_combine deprecation to 0.14 (pydata#3314) Deprecation: groupby, resample default dim. (pydata#3313) Raise error if cmap is list of colors (pydata#3310) Refactor concat to use merge for non-concatenated variables (pydata#3239) Honor `keep_attrs` in DataArray.quantile (pydata#3305) Fix DataArray api doc (pydata#3309) Accept int value in head, thin and tail (pydata#3298) ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (pydata#3301) Update why-xarray.rst with clearer expression (pydata#3307) Compat and encoding deprecation to 0.14 (pydata#3294) Remove deprecated concat kwargs. (pydata#3288) allow np-array levels and colors in 2D plots (pydata#3295) Remove some deprecations (pydata#3292) Make argmin/max work lazy with dask (pydata#3244) Add head, tail and thin methods (pydata#3278) Updater to testing environment name (pydata#3253)
dcherian · Sep 19, 2019 · 599b70a · 599b70a
2 parents d0797f6 + 02e9661
commit 599b70a
Show file tree

Hide file tree

Showing 28 changed files with 928 additions and 537 deletions.
diff --git a/doc/api.rst b/doc/api.rst
@@ -118,6 +118,9 @@ Indexing
    Dataset.loc
    Dataset.isel
    Dataset.sel
+   Dataset.head
+   Dataset.tail
+   Dataset.thin
    Dataset.squeeze
    Dataset.interp
    Dataset.interp_like
@@ -280,6 +283,9 @@ Indexing
    DataArray.loc
    DataArray.isel
    DataArray.sel
+   DataArray.head
+   DataArray.tail
+   DataArray.thin
    DataArray.squeeze
    DataArray.interp
    DataArray.interp_like
@@ -605,6 +611,7 @@ Plotting
 
    Dataset.plot
    DataArray.plot
+   Dataset.plot.scatter
    plot.plot
    plot.contourf
    plot.contour

diff --git a/doc/dask.rst b/doc/dask.rst
@@ -75,13 +75,14 @@ entirely equivalent to opening a dataset using ``open_dataset`` and then
 chunking the data using the ``chunk`` method, e.g.,
 ``xr.open_dataset('example-data.nc').chunk({'time': 10})``.
 
-To open multiple files simultaneously, use :py:func:`~xarray.open_mfdataset`::
+To open multiple files simultaneously in parallel using Dask delayed,
+use :py:func:`~xarray.open_mfdataset`::
 
-    xr.open_mfdataset('my/files/*.nc')
+    xr.open_mfdataset('my/files/*.nc', parallel=True)
 
 This function will automatically concatenate and merge dataset into one in
 the simple cases that it understands (see :py:func:`~xarray.auto_combine`
-for the full disclaimer). By default, ``open_mfdataset`` will chunk each
+for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each
 netCDF file into a single Dask array; again, supply the ``chunks`` argument to
 control the size of the resulting Dask arrays. In more complex cases, you can
 open each file individually using ``open_dataset`` and merge the result, as

diff --git a/doc/io.rst b/doc/io.rst
@@ -99,7 +99,9 @@ netCDF
 The recommended way to store xarray data structures is `netCDF`__, which
 is a binary file format for self-described datasets that originated
 in the geosciences. xarray is based on the netCDF data model, so netCDF files
-on disk directly correspond to :py:class:`~xarray.Dataset` objects.
+on disk directly correspond to :py:class:`~xarray.Dataset` objects (more accurately,
+a group in a netCDF file directly corresponds to a to :py:class:`~xarray.Dataset` object.
+See :ref:`io.netcdf_groups` for more.)
 
 NetCDF is supported on almost all platforms, and parsers exist
 for the vast majority of scientific programming languages. Recent versions of
@@ -121,7 +123,7 @@ read/write netCDF V4 files and use the compression options described below).
 __ https://github.com/Unidata/netcdf4-python
 
 We can save a Dataset to disk using the
-:py:attr:`Dataset.to_netcdf <xarray.Dataset.to_netcdf>` method:
+:py:meth:`~Dataset.to_netcdf` method:
 
 .. ipython:: python
 
@@ -147,19 +149,6 @@ convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back
 when loading, ensuring that the ``DataArray`` that is loaded is always exactly
 the same as the one that was saved.
 
-NetCDF groups are not supported as part of the
-:py:class:`~xarray.Dataset` data model.  Instead, groups can be loaded
-individually as Dataset objects.
-To do so, pass a ``group`` keyword argument to the
-``open_dataset`` function. The group can be specified as a path-like
-string, e.g., to access subgroup 'bar' within group 'foo' pass
-'/foo/bar' as the ``group`` argument.
-In a similar way, the ``group`` keyword argument can be given to the
-:py:meth:`~xarray.Dataset.to_netcdf` method to write to a group
-in a netCDF file.
-When writing multiple groups in one file, pass ``mode='a'`` to ``to_netcdf``
-to ensure that each call does not delete the file.
-
 Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
 Dataset and DataArray objects, and no array values are loaded into memory until
 you try to perform some sort of actual computation. For an example of how these
@@ -195,6 +184,24 @@ It is possible to append or overwrite netCDF variables using the ``mode='a'``
 argument. When using this option, all variables in the dataset will be written
 to the original netCDF file, regardless if they exist in the original dataset.
 
+
+.. _io.netcdf_groups:
+
+Groups
+~~~~~~
+
+NetCDF groups are not supported as part of the :py:class:`~xarray.Dataset` data model.
+Instead, groups can be loaded individually as Dataset objects.
+To do so, pass a ``group`` keyword argument to the
+:py:func:`~xarray.open_dataset` function. The group can be specified as a path-like
+string, e.g., to access subgroup ``'bar'`` within group ``'foo'`` pass
+``'/foo/bar'`` as the ``group`` argument.
+In a similar way, the ``group`` keyword argument can be given to the
+:py:meth:`~xarray.Dataset.to_netcdf` method to write to a group
+in a netCDF file.
+When writing multiple groups in one file, pass ``mode='a'`` to
+:py:meth:`~xarray.Dataset.to_netcdf` to ensure that each call does not delete the file.
+
 .. _io.encoding:
 
 Reading encoded data
@@ -203,7 +210,7 @@ Reading encoded data
 NetCDF files follow some conventions for encoding datetime arrays (as numbers
 with a "units" attribute) and for packing and unpacking data (as
 described by the "scale_factor" and "add_offset" attributes). If the argument
-``decode_cf=True`` (default) is given to ``open_dataset``, xarray will attempt
+``decode_cf=True`` (default) is given to :py:func:`~xarray.open_dataset`, xarray will attempt
 to automatically decode the values in the netCDF objects according to
 `CF conventions`_. Sometimes this will fail, for example, if a variable
 has an invalid "units" or "calendar" attribute. For these cases, you can
@@ -247,6 +254,130 @@ will remove encoding information.
     import os
     os.remove('saved_on_disk.nc')
 
+
+.. _combining multiple files:
+
+Reading multi-file datasets
+...........................
+
+NetCDF files are often encountered in collections, e.g., with different files
+corresponding to different model runs or one file per timestamp.
+xarray can straightforwardly combine such files into a single Dataset by making use of
+:py:func:`~xarray.concat`, :py:func:`~xarray.merge`, :py:func:`~xarray.combine_nested` and
+:py:func:`~xarray.combine_by_coords`. For details on the difference between these
+functions see :ref:`combining data`.
+
+Xarray includes support for manipulating datasets that don't fit into memory
+with dask_. If you have dask installed, you can open multiple files
+simultaneously in parallel using :py:func:`~xarray.open_mfdataset`::
+
+    xr.open_mfdataset('my/files/*.nc', parallel=True)
+
+This function automatically concatenates and merges multiple files into a
+single xarray dataset.
+It is the recommended way to open multiple files with xarray.
+For more details on parallel reading, see :ref:`combining.multi`, :ref:`dask.io` and a
+`blog post`_ by Stephan Hoyer.
+:py:func:`~xarray.open_mfdataset` takes many kwargs that allow you to
+control its behaviour (for e.g. ``parallel``, ``combine``, ``compat``, ``join``, ``concat_dim``).
+See its docstring for more details.
+
+
+.. note::
+
+    A common use-case involves a dataset distributed across a large number of files with
+    each file containing a large number of variables. Commonly a few of these variables
+    need to be concatenated along a dimension (say ``"time"``), while the rest are equal
+    across the datasets (ignoring floating point differences). The following command
+    with suitable modifications (such as ``parallel=True``) works well with such datasets::
+
+         xr.open_mfdataset('my/files/*.nc', concat_dim="time",
+     	              	   data_vars='minimal', coords='minimal', compat='override')
+
+    This command concatenates variables along the ``"time"`` dimension, but only those that
+    already contain the ``"time"`` dimension (``data_vars='minimal', coords='minimal'``).
+    Variables that lack the ``"time"`` dimension are taken from the first dataset
+    (``compat='override'``).
+
+
+.. _dask: http://dask.pydata.org
+.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
+
+Sometimes multi-file datasets are not conveniently organized for easy use of :py:func:`~xarray.open_mfdataset`.
+One can use the ``preprocess`` argument to provide a function that takes a dataset
+and returns a modified Dataset.
+:py:func:`~xarray.open_mfdataset` will call ``preprocess`` on every dataset
+(corresponding to each file) prior to combining them.
+
+
+If :py:func:`~xarray.open_mfdataset` does not meet your needs, other approaches are possible.
+The general pattern for parallel reading of multiple files
+using dask, modifying those datasets and then combining into a single ``Dataset`` is::
+
+     def modify(ds):
+         # modify ds here
+         return ds
+
+
+     # this is basically what open_mfdataset does
+     open_kwargs = dict(decode_cf=True, decode_times=False)
+     open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names]
+     tasks = [dask.delayed(modify)(task) for task in open_tasks]
+     datasets = dask.compute(tasks)  # get a list of xarray.Datasets
+     combined = xr.combine_nested(datasets)  # or some combination of concat, merge
+
+
+As an example, here's how we could approximate ``MFDataset`` from the netCDF4
+library::
+
+    from glob import glob
+    import xarray as xr
+
+    def read_netcdfs(files, dim):
+        # glob expands paths with * to a list of files, like the unix shell
+        paths = sorted(glob(files))
+        datasets = [xr.open_dataset(p) for p in paths]
+        combined = xr.concat(dataset, dim)
+        return combined
+
+    combined = read_netcdfs('/all/my/files/*.nc', dim='time')
+
+This function will work in many cases, but it's not very robust. First, it
+never closes files, which means it will fail one you need to load more than
+a few thousands file. Second, it assumes that you want all the data from each
+file and that it can all fit into memory. In many situations, you only need
+a small subset or an aggregated summary of the data from each file.
+
+Here's a slightly more sophisticated example of how to remedy these
+deficiencies::
+
+    def read_netcdfs(files, dim, transform_func=None):
+        def process_one_path(path):
+            # use a context manager, to ensure the file gets closed after use
+            with xr.open_dataset(path) as ds:
+                # transform_func should do some sort of selection or
+                # aggregation
+                if transform_func is not None:
+                    ds = transform_func(ds)
+                # load all data from the transformed dataset, to ensure we can
+                # use it after closing each original file
+                ds.load()
+                return ds
+
+        paths = sorted(glob(files))
+        datasets = [process_one_path(p) for p in paths]
+        combined = xr.concat(datasets, dim)
+        return combined
+
+    # here we suppose we only care about the combined mean of each file;
+    # you might also use indexing operations like .sel to subset datasets
+    combined = read_netcdfs('/all/my/files/*.nc', dim='time',
+                            transform_func=lambda ds: ds.mean())
+
+This pattern works well and is very robust. We've used similar code to process
+tens of thousands of files constituting 100s of GB of data.
+
+
 .. _io.netcdf.writing_encoded:
 
 Writing encoded data
@@ -817,84 +948,3 @@ For CSV files, one might also consider `xarray_extras`_.
 .. _xarray_extras: https://xarray-extras.readthedocs.io/en/latest/api/csv.html
 
 .. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html
-
-
-.. _combining multiple files:
-
-
-Combining multiple files
-------------------------
-
-NetCDF files are often encountered in collections, e.g., with different files
-corresponding to different model runs. xarray can straightforwardly combine such
-files into a single Dataset by making use of :py:func:`~xarray.concat`,
-:py:func:`~xarray.merge`, :py:func:`~xarray.combine_nested` and
-:py:func:`~xarray.combine_by_coords`. For details on the difference between these
-functions see :ref:`combining data`.
-
-.. note::
-
-    Xarray includes support for manipulating datasets that don't fit into memory
-    with dask_. If you have dask installed, you can open multiple files
-    simultaneously using :py:func:`~xarray.open_mfdataset`::
-
-        xr.open_mfdataset('my/files/*.nc')
-
-    This function automatically concatenates and merges multiple files into a
-    single xarray dataset.
-    It is the recommended way to open multiple files with xarray.
-    For more details, see :ref:`combining.multi`, :ref:`dask.io` and a
-    `blog post`_ by Stephan Hoyer.
-
-.. _dask: http://dask.pydata.org
-.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
-
-For example, here's how we could approximate ``MFDataset`` from the netCDF4
-library::
-
-    from glob import glob
-    import xarray as xr
-
-    def read_netcdfs(files, dim):
-        # glob expands paths with * to a list of files, like the unix shell
-        paths = sorted(glob(files))
-        datasets = [xr.open_dataset(p) for p in paths]
-        combined = xr.concat(dataset, dim)
-        return combined
-
-    combined = read_netcdfs('/all/my/files/*.nc', dim='time')
-
-This function will work in many cases, but it's not very robust. First, it
-never closes files, which means it will fail one you need to load more than
-a few thousands file. Second, it assumes that you want all the data from each
-file and that it can all fit into memory. In many situations, you only need
-a small subset or an aggregated summary of the data from each file.
-
-Here's a slightly more sophisticated example of how to remedy these
-deficiencies::
-
-    def read_netcdfs(files, dim, transform_func=None):
-        def process_one_path(path):
-            # use a context manager, to ensure the file gets closed after use
-            with xr.open_dataset(path) as ds:
-                # transform_func should do some sort of selection or
-                # aggregation
-                if transform_func is not None:
-                    ds = transform_func(ds)
-                # load all data from the transformed dataset, to ensure we can
-                # use it after closing each original file
-                ds.load()
-                return ds
-
-        paths = sorted(glob(files))
-        datasets = [process_one_path(p) for p in paths]
-        combined = xr.concat(datasets, dim)
-        return combined
-
-    # here we suppose we only care about the combined mean of each file;
-    # you might also use indexing operations like .sel to subset datasets
-    combined = read_netcdfs('/all/my/files/*.nc', dim='time',
-                            transform_func=lambda ds: ds.mean())
-
-This pattern works well and is very robust. We've used similar code to process
-tens of thousands of files constituting 100s of GB of data.