From 283b4feba601e9838ed538106b2091d0b8264c77 Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Fri, 4 Oct 2019 17:04:36 +0000 Subject: [PATCH] Docs/more fixes (#2934) * Move netcdf to beginning of io.rst * Better indexing example. * Start de-emphasizing pandas * misc. * compute, load, persist docstrings + text. * split-apply-combine. * np.newaxis. * misc. * some dask stuff. * Little more dask. * undo index.rst changes. * link to dask docs on chunks * Fix io.rst. * small changes. * rollingupdate. * joe's review --- doc/computation.rst | 4 +- doc/dask.rst | 65 +++++++++++---- doc/faq.rst | 46 +++++++---- doc/index.rst | 2 +- doc/io.rst | 178 +++++++++++++++++++++++------------------ doc/quick-overview.rst | 16 ++-- doc/why-xarray.rst | 17 ++-- xarray/core/dataset.py | 19 ++--- 8 files changed, 212 insertions(+), 135 deletions(-) diff --git a/doc/computation.rst b/doc/computation.rst index 3d10774bcac..ae5f4bc5c66 100644 --- a/doc/computation.rst +++ b/doc/computation.rst @@ -179,7 +179,9 @@ a value when aggregating: r = arr.rolling(y=3, center=True, min_periods=2) r.mean() -Note that rolling window aggregations are faster when bottleneck_ is installed. +.. tip:: + + Note that rolling window aggregations are faster and use less memory when bottleneck_ is installed. This only applies to numpy-backed xarray objects. .. _bottleneck: https://github.com/kwgoodman/bottleneck/ diff --git a/doc/dask.rst b/doc/dask.rst index 19cbc11292c..5bdbf779463 100644 --- a/doc/dask.rst +++ b/doc/dask.rst @@ -5,13 +5,14 @@ Parallel computing with Dask xarray integrates with `Dask `__ to support parallel computations and streaming computation on datasets that don't fit into memory. - Currently, Dask is an entirely optional feature for xarray. However, the benefits of using Dask are sufficiently strong that Dask may become a required dependency in a future version of xarray. For a full example of how to use xarray's Dask integration, read the -`blog post introducing xarray and Dask`_. +`blog post introducing xarray and Dask`_. More up-to-date examples +may be found at the `Pangeo project's use-cases `_ +and at the `Dask examples website `_. .. _blog post introducing xarray and Dask: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/ @@ -37,13 +38,14 @@ which allows Dask to take full advantage of multiple processors available on most modern computers. For more details on Dask, read `its documentation `__. +Note that xarray only makes use of ``dask.array`` and ``dask.delayed``. .. _dask.io: Reading and writing data ------------------------ -The usual way to create a dataset filled with Dask arrays is to load the +The usual way to create a ``Dataset`` filled with Dask arrays is to load the data from a netCDF file or files. You can do this by supplying a ``chunks`` argument to :py:func:`~xarray.open_dataset` or using the :py:func:`~xarray.open_mfdataset` function. @@ -71,8 +73,8 @@ argument to :py:func:`~xarray.open_dataset` or using the In this example ``latitude`` and ``longitude`` do not appear in the ``chunks`` dict, so only one chunk will be used along those dimensions. It is also -entirely equivalent to opening a dataset using ``open_dataset`` and then -chunking the data using the ``chunk`` method, e.g., +entirely equivalent to opening a dataset using :py:meth:`~xarray.open_dataset` +and then chunking the data using the ``chunk`` method, e.g., ``xr.open_dataset('example-data.nc').chunk({'time': 10})``. To open multiple files simultaneously in parallel using Dask delayed, @@ -80,13 +82,14 @@ use :py:func:`~xarray.open_mfdataset`:: xr.open_mfdataset('my/files/*.nc', parallel=True) -This function will automatically concatenate and merge dataset into one in +This function will automatically concatenate and merge datasets into one in the simple cases that it understands (see :py:func:`~xarray.auto_combine` -for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each +for the full disclaimer). By default, :py:meth:`~xarray.open_mfdataset` will chunk each netCDF file into a single Dask array; again, supply the ``chunks`` argument to control the size of the resulting Dask arrays. In more complex cases, you can -open each file individually using ``open_dataset`` and merge the result, as -described in :ref:`combining data`. +open each file individually using :py:meth:`~xarray.open_dataset` and merge the result, as +described in :ref:`combining data`. Passing the keyword argument ``parallel=True`` to :py:meth:`~xarray.open_mfdataset` will speed up the reading of large multi-file datasets by +executing those read tasks in parallel using ``dask.delayed``. You'll notice that printing a dataset still shows a preview of array values, even if they are actually Dask arrays. We can do this quickly with Dask because @@ -106,7 +109,7 @@ usual way. ds.to_netcdf('manipulated-example-data.nc') By setting the ``compute`` argument to ``False``, :py:meth:`~xarray.Dataset.to_netcdf` -will return a Dask delayed object that can be computed later. +will return a ``dask.delayed`` object that can be computed later. .. ipython:: python @@ -153,8 +156,14 @@ explicit conversion step. One notable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory. +.. tip:: + + By default, dask uses its multi-threaded scheduler, which distributes work across + multiple cores and allows for processing some datasets that do not fit into memory. + For running across a cluster, `setup the distributed scheduler `_. + The easiest way to convert an xarray data structure from lazy Dask arrays into -eager, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method: +*eager*, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method: .. ipython:: python @@ -191,11 +200,20 @@ Dask arrays using the :py:meth:`~xarray.Dataset.persist` method: ds = ds.persist() -This is particularly useful when using a distributed cluster because the data -will be loaded into distributed memory across your machines and be much faster -to use than reading repeatedly from disk. Warning that on a single machine -this operation will try to load all of your data into memory. You should make -sure that your dataset is not larger than available memory. +:py:meth:`~xarray.Dataset.persist` is particularly useful when using a +distributed cluster because the data will be loaded into distributed memory +across your machines and be much faster to use than reading repeatedly from +disk. + +.. warning:: + + On a single machine :py:meth:`~xarray.Dataset.persist` will try to load all of + your data into memory. You should make sure that your dataset is not larger than + available memory. + +.. note:: + For more on the differences between :py:meth:`~xarray.Dataset.persist` and + :py:meth:`~xarray.Dataset.compute` see this `Stack Overflow answer `_ and the `Dask documentation `_. For performance you may wish to consider chunk sizes. The correct choice of chunk size depends both on your data and on the operations you want to perform. @@ -381,6 +399,11 @@ one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up Dask operations can be noticeable, and you may need even larger chunksizes. +.. tip:: + + Check out the dask documentation on `chunks `_. + + Optimization Tips ----------------- @@ -390,4 +413,12 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin 2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 `_) -3. Specify smaller chunks across space when using ``open_mfdataset()`` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1). +3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1). + +4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset` + can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package. + +5. Some dask-specific tips may be found `here `_. + +6. The dask `diagnostics `_ can be + useful in identifying performance bottlenecks. diff --git a/doc/faq.rst b/doc/faq.rst index 22a4f6cf095..28a1f7395c3 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -11,6 +11,38 @@ Frequently Asked Questions import xarray as xr np.random.seed(123456) + +Your documentation keeps mentioning pandas. What is pandas? +----------------------------------------------------------- + +pandas_ is a very popular data analysis package in Python +with wide usage in many fields. Our API is heavily inspired by pandas — +this is why there are so many references to pandas. + +.. _pandas: https://pandas.pydata.org + + +Do I need to know pandas to use xarray? +--------------------------------------- + +No! Our API is heavily inspired by pandas so while knowing pandas will let you +become productive more quickly, knowledge of pandas is not necessary to use xarray. + + +Should I use xarray instead of pandas? +-------------------------------------- + +It's not an either/or choice! xarray provides robust support for converting +back and forth between the tabular data-structures of pandas and its own +multi-dimensional data-structures. + +That said, you should only bother with xarray if some aspect of data is +fundamentally multi-dimensional. If your data is unstructured or +one-dimensional, pandas is usually the right choice: it has better performance +for common operations such as ``groupby`` and you'll find far more usage +examples online. + + Why is pandas not enough? ------------------------- @@ -56,20 +88,6 @@ of the "time" dimension. You never need to reshape arrays (e.g., with ``np.newaxis``) to align them for arithmetic operations in xarray. -Should I use xarray instead of pandas? --------------------------------------- - -It's not an either/or choice! xarray provides robust support for converting -back and forth between the tabular data-structures of pandas and its own -multi-dimensional data-structures. - -That said, you should only bother with xarray if some aspect of data is -fundamentally multi-dimensional. If your data is unstructured or -one-dimensional, pandas is usually the right choice: it has better performance -for common operations such as ``groupby`` and you'll find far more usage -examples online. - - Why don't aggregations return Python scalars? --------------------------------------------- diff --git a/doc/index.rst b/doc/index.rst index e5bd03801ff..972eb0a732e 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -11,7 +11,7 @@ intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures. -Xarray was inspired by and borrows heavily from pandas_, the popular data +Xarray is inspired by and borrows heavily from pandas_, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF_ files, which were the source of xarray's data model, and integrates tightly with dask_ for parallel diff --git a/doc/io.rst b/doc/io.rst index 0943b598a7f..7f0c2333ce5 100644 --- a/doc/io.rst +++ b/doc/io.rst @@ -15,82 +15,6 @@ format (recommended). import xarray as xr np.random.seed(123456) -.. _io.pickle: - -Pickle ------- - -The simplest way to serialize an xarray object is to use Python's built-in pickle -module: - -.. ipython:: python - - import pickle - - ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))}, - coords={'x': [10, 20, 30, 40], - 'y': pd.date_range('2000-01-01', periods=5), - 'z': ('x', list('abcd'))}) - - # use the highest protocol (-1) because it is way faster than the default - # text based pickle format - pkl = pickle.dumps(ds, protocol=-1) - - pickle.loads(pkl) - -Pickling is important because it doesn't require any external libraries -and lets you use xarray objects with Python modules like -:py:mod:`multiprocessing` or :ref:`Dask `. However, pickling is -**not recommended for long-term storage**. - -Restoring a pickle requires that the internal structure of the types for the -pickled data remain unchanged. Because the internal design of xarray is still -being refined, we make no guarantees (at this point) that objects pickled with -this version of xarray will work in future versions. - -.. note:: - - When pickling an object opened from a NetCDF file, the pickle file will - contain a reference to the file on disk. If you want to store the actual - array values, load it into memory first with :py:meth:`~xarray.Dataset.load` - or :py:meth:`~xarray.Dataset.compute`. - -.. _dictionary io: - -Dictionary ----------- - -We can convert a ``Dataset`` (or a ``DataArray``) to a dict using -:py:meth:`~xarray.Dataset.to_dict`: - -.. ipython:: python - - d = ds.to_dict() - d - -We can create a new xarray object from a dict using -:py:meth:`~xarray.Dataset.from_dict`: - -.. ipython:: python - - ds_dict = xr.Dataset.from_dict(d) - ds_dict - -Dictionary support allows for flexible use of xarray objects. It doesn't -require external libraries and dicts can easily be pickled, or converted to -json, or geojson. All the values are converted to lists, so dicts might -be quite large. - -To export just the dataset schema, without the data itself, use the -``data=False`` option: - -.. ipython:: python - - ds.to_dict(data=False) - -This can be useful for generating indices of dataset contents to expose to -search indices or other automated data discovery tools. - .. _io.netcdf: netCDF @@ -127,12 +51,25 @@ We can save a Dataset to disk using the .. ipython:: python + ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))}, + coords={'x': [10, 20, 30, 40], + 'y': pd.date_range('2000-01-01', periods=5), + 'z': ('x', list('abcd'))}) + ds.to_netcdf('saved_on_disk.nc') By default, the file is saved as netCDF4 (assuming netCDF4-Python is installed). You can control the format and engine used to write the file with the ``format`` and ``engine`` arguments. +.. tip:: + + Using the `h5netcdf `_ package + by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_dataset` can + sometimes be quicker than the default ``engine='netcdf4'`` that uses the + `netCDF4 `_ package. + + We can load netCDF files to create a new Dataset using :py:func:`~xarray.open_dataset`: @@ -149,7 +86,15 @@ convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back when loading, ensuring that the ``DataArray`` that is loaded is always exactly the same as the one that was saved. -Data is always loaded lazily from netCDF files. You can manipulate, slice and subset +A dataset can also be loaded or written to a specific group within a netCDF +file. To load from a group, pass a ``group`` keyword argument to the +``open_dataset`` function. The group can be specified as a path-like +string, e.g., to access subgroup 'bar' within group 'foo' pass +'/foo/bar' as the ``group`` argument. When writing multiple groups in one file, +pass ``mode='a'`` to ``to_netcdf`` to ensure that each call does not delete the +file. + +Data is *always* loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below. @@ -251,8 +196,6 @@ will remove encoding information. :suppress: ds_disk.close() - import os - os.remove('saved_on_disk.nc') .. _combining multiple files: @@ -681,6 +624,83 @@ that require NASA's URS authentication:: __ http://docs.python-requests.org __ http://pydap.readthedocs.io/en/latest/client.html#authentication +.. _io.pickle: + +Pickle +------ + +The simplest way to serialize an xarray object is to use Python's built-in pickle +module: + +.. ipython:: python + + import pickle + + # use the highest protocol (-1) because it is way faster than the default + # text based pickle format + pkl = pickle.dumps(ds, protocol=-1) + + pickle.loads(pkl) + +Pickling is important because it doesn't require any external libraries +and lets you use xarray objects with Python modules like +:py:mod:`multiprocessing` or :ref:`Dask `. However, pickling is +**not recommended for long-term storage**. + +Restoring a pickle requires that the internal structure of the types for the +pickled data remain unchanged. Because the internal design of xarray is still +being refined, we make no guarantees (at this point) that objects pickled with +this version of xarray will work in future versions. + +.. note:: + + When pickling an object opened from a NetCDF file, the pickle file will + contain a reference to the file on disk. If you want to store the actual + array values, load it into memory first with :py:meth:`~xarray.Dataset.load` + or :py:meth:`~xarray.Dataset.compute`. + +.. _dictionary io: + +Dictionary +---------- + +We can convert a ``Dataset`` (or a ``DataArray``) to a dict using +:py:meth:`~xarray.Dataset.to_dict`: + +.. ipython:: python + + d = ds.to_dict() + d + +We can create a new xarray object from a dict using +:py:meth:`~xarray.Dataset.from_dict`: + +.. ipython:: python + + ds_dict = xr.Dataset.from_dict(d) + ds_dict + +Dictionary support allows for flexible use of xarray objects. It doesn't +require external libraries and dicts can easily be pickled, or converted to +json, or geojson. All the values are converted to lists, so dicts might +be quite large. + +To export just the dataset schema, without the data itself, use the +``data=False`` option: + +.. ipython:: python + + ds.to_dict(data=False) + +This can be useful for generating indices of dataset contents to expose to +search indices or other automated data discovery tools. + +.. ipython:: python + :suppress: + + import os + os.remove('saved_on_disk.nc') + .. _io.rasterio: Rasterio diff --git a/doc/quick-overview.rst b/doc/quick-overview.rst index 1224f59515b..7d84199323d 100644 --- a/doc/quick-overview.rst +++ b/doc/quick-overview.rst @@ -48,21 +48,21 @@ Here are the key properties for a ``DataArray``: Indexing -------- -xarray supports four kind of indexing. Since we have assigned coordinate labels to the x dimension we can use label-based indexing along that dimension just like pandas. The four examples below all yield the same result but at varying levels of convenience and intuitiveness. +xarray supports four kind of indexing. Since we have assigned coordinate labels to the x dimension we can use label-based indexing along that dimension just like pandas. The four examples below all yield the same result (the value at `x=10`) but at varying levels of convenience and intuitiveness. .. ipython:: python # positional and by integer label, like numpy - data[[0, 1]] + data[0, :] - # positional and by coordinate label, like pandas - data.loc[10:20] + # loc or "location": positional and coordinate label, like pandas + data.loc[10] - # by dimension name and integer label - data.isel(x=slice(2)) + # isel or "integer select": by dimension name and integer label + data.isel(x=0) - # by dimension name and coordinate label - data.sel(x=[10, 20]) + # sel or "select": by dimension name and coordinate label + data.sel(x=10) Unlike positional indexing, label-based indexing frees us from having to know how our array is organized. All we need to know are the dimension name and the label we wish to index i.e. ``data.sel(x=10)`` works regardless of whether ``x`` is the first or second dimension of the array and regardless of whether ``10`` is the first or second element of ``x``. We have already told xarray that x is the first dimension when we created ``data``: xarray keeps track of this so we don't have to. For more, see :ref:`indexing`. diff --git a/doc/why-xarray.rst b/doc/why-xarray.rst index 25d558d99d5..be8284d88c2 100644 --- a/doc/why-xarray.rst +++ b/doc/why-xarray.rst @@ -1,6 +1,10 @@ Overview: Why xarray? ===================== +Xarray introduces labels in the form of dimensions, coordinates and attributes on top of +raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, +and less error-prone developer experience. + What labels enable ------------------ @@ -18,13 +22,14 @@ Xarray doesn't just keep track of labels on arrays -- it uses them to provide a powerful and concise interface. For example: - Apply operations over dimensions by name: ``x.sum('time')``. -- Select values by label instead of integer location: +- Select values by label (or logical location) instead of integer location: ``x.loc['2014-01-01']`` or ``x.sel(time='2014-01-01')``. - Mathematical operations (e.g., ``x - y``) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape. -- Flexible split-apply-combine operations with groupby: +- Easily use the `split-apply-combine `_ + paradigm with ``groupby``: ``x.groupby('time.dayofyear').mean()``. -- Database like alignment based on coordinate labels that smoothly +- Database-like alignment based on coordinate labels that smoothly handles missing values: ``x, y = xr.align(x, y, join='outer')``. - Keep track of arbitrary metadata in the form of a Python dictionary: ``x.attrs``. @@ -33,8 +38,8 @@ The N-dimensional nature of xarray's data structures makes it suitable for deali with multi-dimensional scientific data, and its use of dimension names instead of axis labels (``dim='time'`` instead of ``axis=0``) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don't -need to keep track of the order of arrays dimensions or insert dummy dimensions -(e.g., ``np.newaxis``) to align arrays. +need to keep track of the order of an array's dimensions or insert dummy dimensions of +size 1 to align arrays (e.g., using ``np.newaxis``). The immediate payoff of using xarray is that you'll write less code. The long-term payoff is that you'll understand what you were thinking when you come @@ -44,7 +49,7 @@ Core data structures -------------------- xarray has two core data structures, which build upon and extend the core -strengths of NumPy_ and pandas_. Both are fundamentally N-dimensional: +strengths of NumPy_ and pandas_. Both data structures are fundamentally N-dimensional: - :py:class:`~xarray.DataArray` is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a :py:class:`pandas.Series`. The name diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py index 03276d61cf0..71025cb3040 100644 --- a/xarray/core/dataset.py +++ b/xarray/core/dataset.py @@ -614,8 +614,9 @@ def sizes(self) -> Mapping[Hashable, int]: return self.dims def load(self, **kwargs) -> "Dataset": - """Manually trigger loading of this dataset's data from disk or a - remote source into memory and return this dataset. + """Manually trigger loading and/or computation of this dataset's data + from disk or a remote source into memory and return this dataset. + Unlike compute, the original dataset is modified and returned. Normally, it should not be necessary to call this method in user code, because all xarray functions should either work on deferred data or @@ -771,9 +772,9 @@ def _dask_postpersist(dsk, info, *args): return Dataset._construct_direct(variables, *args) def compute(self, **kwargs) -> "Dataset": - """Manually trigger loading of this dataset's data from disk or a - remote source into memory and return a new dataset. The original is - left unaltered. + """Manually trigger loading and/or computation of this dataset's data + from disk or a remote source into memory and return a new dataset. + Unlike load, the original dataset is left unaltered. Normally, it should not be necessary to call this method in user code, because all xarray functions should either work on deferred data or @@ -816,10 +817,10 @@ def persist(self, **kwargs) -> "Dataset": """ Trigger computation, keeping data as dask arrays This operation can be used to trigger computation on underlying dask - arrays, similar to ``.compute()``. However this operation keeps the - data as dask arrays. This is particularly useful when using the - dask.distributed scheduler and you want to load a large amount of data - into distributed memory. + arrays, similar to ``.compute()`` or ``.load()``. However this + operation keeps the data as dask arrays. This is particularly useful + when using the dask.distributed scheduler and you want to load a large + amount of data into distributed memory. Parameters ----------