From d4c46829b283ab7e7b7db8b86dae77861ce68f3c Mon Sep 17 00:00:00 2001
From: Stephan Hoyer <shoyer@gmail.com>
Date: Thu, 10 Jan 2019 17:06:10 -0800
Subject: [PATCH] DOC: refresh "Why xarray" and shorten top-level description
 (#2657)

* DOC: refresh "Why xarray" and shorten top-level description

This documentation revamp builds upon rabernat's rewrite in GH2430.

The main change is that the three paragraph description felt too long to me, so
I moved the background paragraph on multi-dimensional arrays into the next
section, on "Why xarray". I also ended up rewriting most of that page, and
made a few adjustments to the FAQ and related projects pages.

* 'why xarray' in setup.py, too

* Updates per review
---
 README.rst               | 84 ++++++++++++----------------------------
 doc/faq.rst              | 14 ++++---
 doc/index.rst            | 30 ++++++--------
 doc/related-projects.rst | 18 ++++++---
 doc/why-xarray.rst       | 76 ++++++++++++++++++++++--------------
 setup.py                 | 51 ++++++++++++++++--------
 6 files changed, 139 insertions(+), 134 deletions(-)

diff --git a/README.rst b/README.rst
index a4c8f6d200b..f30d9dde8bb 100644
--- a/README.rst
+++ b/README.rst
@@ -9,49 +9,47 @@ xarray: N-D labeled arrays and datasets
    :target: https://coveralls.io/r/pydata/xarray
 .. image:: https://readthedocs.org/projects/xray/badge/?version=latest
    :target: http://xarray.pydata.org/
-.. image:: https://img.shields.io/pypi/v/xarray.svg
-   :target: https://pypi.python.org/pypi/xarray/
-.. image:: https://zenodo.org/badge/13221727.svg
-  :target: https://zenodo.org/badge/latestdoi/13221727
 .. image:: http://img.shields.io/badge/benchmarked%20by-asv-green.svg?style=flat
   :target: http://pandas.pydata.org/speed/xarray/
-.. image:: https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A
-  :target: http://numfocus.org
+.. image:: https://img.shields.io/pypi/v/xarray.svg
+   :target: https://pypi.python.org/pypi/xarray/
 
 **xarray** (formerly **xray**) is an open source project and Python package
 that makes working with labelled multi-dimensional arrays simple,
 efficient, and fun!
 
-Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
-"tensors") are an essential part of computational science.
-They are encountered in a wide range of fields, including physics, astronomy,
-geoscience, bioinformatics, engineering, finance, and deep learning.
-In Python, NumPy_ provides the fundamental data structure and API for
-working with raw ND arrays.
-However, real-world datasets are usually more than just raw numbers;
-they have labels which encode information about how the array values map
-to locations in space, time, etc.
+Xarray introduces labels in the form of dimensions, coordinates and
+attributes on top of raw NumPy_-like arrays, which allows for a more
+intuitive, more concise, and less error-prone developer experience.
+The package includes a large and growing library of domain-agnostic functions
+for advanced analytics and visualization with these data structures.
 
-By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
-NumPy-like arrays, xarray is able to understand these labels and use them to
-provide a more intuitive, more concise, and less error-prone experience.
-Xarray also provides a large and growing library of functions for advanced
-analytics and visualization with these data structures.
 Xarray was inspired by and borrows heavily from pandas_, the popular data
 analysis package focused on labelled tabular data.
-Xarray can read and write data from most common labeled ND-array storage
-formats and is particularly tailored to working with netCDF_ files, which were
-the source of xarray's data model.
+It is particularly tailored to working with netCDF_ files, which were the
+source of xarray's data model, and integrates tightly with dask_ for parallel
+computing.
 
-.. _NumPy: http://www.numpy.org/
+.. _NumPy: http://www.numpy.org
 .. _pandas: http://pandas.pydata.org
+.. _dask: http://dask.org
 .. _netCDF: http://www.unidata.ucar.edu/software/netcdf
 
 Why xarray?
 -----------
 
-Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many
-powerful array operations possible:
+Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
+"tensors") are an essential part of computational science.
+They are encountered in a wide range of fields, including physics, astronomy,
+geoscience, bioinformatics, engineering, finance, and deep learning.
+In Python, NumPy_ provides the fundamental data structure and API for
+working with raw ND arrays.
+However, real-world datasets are usually more than just raw numbers;
+they have labels which encode information about how the array values map
+to locations in space, time, etc.
+
+Xarray doesn't just keep track of labels on arrays -- it uses them to provide a
+powerful and concise interface. For example:
 
 -  Apply operations over dimensions by name: ``x.sum('time')``.
 -  Select values by label instead of integer location:
@@ -65,42 +63,10 @@ powerful array operations possible:
 -  Keep track of arbitrary metadata in the form of a Python dictionary:
    ``x.attrs``.
 
-pandas_ provides many of these features, but it does not make use of dimension
-names, and its core data structures are fixed dimensional arrays.
-
-Why isn't pandas enough?
-------------------------
-
-pandas_ excels at working with tabular data. That suffices for many statistical
-analyses, but physical scientists rely on N-dimensional arrays -- which is
-where xarray comes in.
-
-xarray aims to provide a data analysis toolkit as powerful as pandas_ but
-designed for working with homogeneous N-dimensional arrays
-instead of tabular data. When possible, we copy the pandas API and rely on
-pandas's highly optimized internals (in particular, for fast indexing).
-
-Why netCDF?
------------
-
-Because xarray implements the same data model as the netCDF_ file format,
-xarray datasets have a natural and portable serialization format. But it is also
-easy to robustly convert an xarray ``DataArray`` to and from a numpy ``ndarray``
-or a pandas ``DataFrame`` or ``Series``, providing compatibility with the full
-`PyData ecosystem <http://pydata.org/>`__.
-
-Our target audience is anyone who needs N-dimensional labeled arrays, but we
-are particularly focused on the data analysis needs of physical scientists --
-especially geoscientists who already know and love netCDF_.
-
-.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
-.. _pandas: http://pandas.pydata.org
-.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
-
 Documentation
 -------------
 
-The official documentation is hosted on ReadTheDocs at http://xarray.pydata.org/
+Learn more about xarray in its official documentation at http://xarray.pydata.org/
 
 Contributing
 ------------
diff --git a/doc/faq.rst b/doc/faq.rst
index 44bc021024b..465a5a6d250 100644
--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -18,8 +18,9 @@ pandas is a fantastic library for analysis of low-dimensional labelled data -
 if it can be sensibly described as "rows and columns", pandas is probably the
 right choice.  However, sometimes we want to use higher dimensional arrays
 (`ndim > 2`), or arrays for which the order of dimensions (e.g., columns vs
-rows) shouldn't really matter. For example, climate and weather data is often
-natively expressed in 4 or more dimensions: time, x, y and z.
+rows) shouldn't really matter. For example, the images of a movie can be
+natively represented as an array with four dimensions: time, row, column and
+color.
 
 Pandas has historically supported N-dimensional panels, but deprecated them in
 version 0.20 in favor of Xarray data structures.  There are now built-in methods
@@ -39,9 +40,8 @@ if you were using Panels:
   xarray ``Dataset``.
 
 You can :ref:`read about switching from Panels to Xarray here <panel transition>`.
-Pandas gets a lot of things right, but scientific users need fully multi-
-dimensional data structures.
-
+Pandas gets a lot of things right, but many science, engineering and complex
+analytics use cases need fully multi-dimensional data structures.
 
 How do xarray data structures differ from those found in pandas?
 ----------------------------------------------------------------
@@ -65,7 +65,9 @@ multi-dimensional data-structures.
 
 That said, you should only bother with xarray if some aspect of data is
 fundamentally multi-dimensional. If your data is unstructured or
-one-dimensional, stick with pandas.
+one-dimensional, pandas is usually the right choice: it has better performance
+for common operations such as ``groupby`` and you'll find far more usage
+examples online.
 
 
 Why don't aggregations return Python scalars?
diff --git a/doc/index.rst b/doc/index.rst
index fe6d2874953..dbe911011cd 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -5,29 +5,21 @@ xarray: N-D labeled arrays and datasets in Python
 that makes working with labelled multi-dimensional arrays simple,
 efficient, and fun!
 
-Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
-"tensors") are an essential part of computational science.
-They are encountered in a wide range of fields, including physics, astronomy,
-geoscience, bioinformatics, engineering, finance, and deep learning.
-In Python, NumPy_ provides the fundamental data structure and API for
-working with raw ND arrays.
-However, real-world datasets are usually more than just raw numbers;
-they have labels which encode information about how the array values map
-to locations in space, time, etc.
-
-By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
-NumPy-like arrays, xarray is able to understand these labels and use them to
-provide a more intuitive, more concise, and less error-prone experience.
-Xarray also provides a large and growing library of functions for advanced
-analytics and visualization with these data structures.
+Xarray introduces labels in the form of dimensions, coordinates and
+attributes on top of raw NumPy_-like arrays, which allows for a more
+intuitive, more concise, and less error-prone developer experience.
+The package includes a large and growing library of domain-agnostic functions
+for advanced analytics and visualization with these data structures.
+
 Xarray was inspired by and borrows heavily from pandas_, the popular data
 analysis package focused on labelled tabular data.
-Xarray can read and write data from most common labeled ND-array storage
-formats and is particularly tailored to working with netCDF_ files, which were
-the source of xarray's data model.
+It is particularly tailored to working with netCDF_ files, which were the
+source of xarray's data model, and integrates tightly with dask_ for parallel
+computing.
 
-.. _NumPy: http://www.numpy.org/
+.. _NumPy: http://www.numpy.org
 .. _pandas: http://pandas.pydata.org
+.. _dask: http://dask.org
 .. _netCDF: http://www.unidata.ucar.edu/software/netcdf
 
 Documentation
diff --git a/doc/related-projects.rst b/doc/related-projects.rst
index cf89c715bc7..c89e324ff7c 100644
--- a/doc/related-projects.rst
+++ b/doc/related-projects.rst
@@ -3,7 +3,7 @@
 Xarray related projects
 -----------------------
 
-Here below is a list of several existing libraries that build
+Here below is a list of existing open source projects that build
 functionality upon xarray. See also section :ref:`internals` for more
 details on how to build xarray extensions.
 
@@ -39,11 +39,16 @@ Geosciences
 
 Machine Learning
 ~~~~~~~~~~~~~~~~
-- `cesium <http://cesium-ml.org/>`_: machine learning for time series analysis
+- `ArviZ <https://arviz-devs.github.io/arviz/>`_: Exploratory analysis of Bayesian models, built on top of xarray.
 - `Elm <https://ensemble-learning-models.readthedocs.io>`_: Parallel machine learning on xarray data structures
 - `sklearn-xarray (1) <https://phausamann.github.io/sklearn-xarray>`_: Combines scikit-learn and xarray (1).
 - `sklearn-xarray (2) <https://sklearn-xarray.readthedocs.io/en/latest/>`_: Combines scikit-learn and xarray (2).
 
+Other domains
+~~~~~~~~~~~~~
+- `ptsa <https://pennmem.github.io/ptsa_new/html/index.html>`_: EEG Time Series Analysis
+- `pycalphad <https://pycalphad.org/docs/latest/>`_: Computational Thermodynamics in Python
+
 Extend xarray capabilities
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 - `Collocate <https://github.com/cistools/collocate>`_: Collocate xarray trajectories in arbitrary physical dimensions
@@ -61,9 +66,10 @@ Visualization
 - `hvplot <https://hvplot.pyviz.org/>`_ : A high-level plotting API for the PyData ecosystem built on HoloViews.
 - `psyplot <https://psyplot.readthedocs.io>`_: Interactive data visualization with python.
 
-Other
-~~~~~
-- `ptsa <https://pennmem.github.io/ptsa_new/html/index.html>`_: EEG Time Series Analysis
-- `pycalphad <https://pycalphad.org/docs/latest/>`_: Computational Thermodynamics in Python
+Non-Python projects
+~~~~~~~~~~~~~~~~~~~
+- `xframe <https://github.com/QuantStack/xframe>`_: C++ data structures inspired by xarray.
+- `AxisArrays <https://github.com/JuliaArrays/AxisArrays.jl>`_ and
+  `NamedArrays <https://github.com/davidavdav/NamedArrays.jl>`_: similar data structures for Julia.
 
 More projects can be found at the `"xarray" Github topic <https://github.com/topics/xarray>`_.
diff --git a/doc/why-xarray.rst b/doc/why-xarray.rst
index e9f30fe25be..d0a6c591b29 100644
--- a/doc/why-xarray.rst
+++ b/doc/why-xarray.rst
@@ -1,11 +1,21 @@
 Overview: Why xarray?
 =====================
 
-Features
---------
-
-Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many
-powerful array operations possible:
+What labels enable
+------------------
+
+Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
+"tensors") are an essential part of computational science.
+They are encountered in a wide range of fields, including physics, astronomy,
+geoscience, bioinformatics, engineering, finance, and deep learning.
+In Python, NumPy_ provides the fundamental data structure and API for
+working with raw ND arrays.
+However, real-world datasets are usually more than just raw numbers;
+they have labels which encode information about how the array values map
+to locations in space, time, etc.
+
+Xarray doesn't just keep track of labels on arrays -- it uses them to provide a
+powerful and concise interface. For example:
 
 -  Apply operations over dimensions by name: ``x.sum('time')``.
 -  Select values by label instead of integer location:
@@ -19,9 +29,6 @@ powerful array operations possible:
 -  Keep track of arbitrary metadata in the form of a Python dictionary:
    ``x.attrs``.
 
-pandas_ provides many of these features, but it does not make use of dimension
-names, and its core data structures are fixed dimensional arrays.
-
 The N-dimensional nature of xarray's data structures makes it suitable for dealing
 with multi-dimensional scientific data, and its use of dimension names
 instead of axis labels (``dim='time'`` instead of ``axis=0``) makes such
@@ -29,10 +36,15 @@ arrays much more manageable than the raw numpy ndarray: with xarray, you don't
 need to keep track of the order of arrays dimensions or insert dummy dimensions
 (e.g., ``np.newaxis``) to align arrays.
 
+The immediate payoff of using xarray is that you'll write less code. The
+long-term payoff is that you'll understand what you were thinking when you come
+back to look at it weeks or months later.
+
 Core data structures
 --------------------
 
-xarray has two core data structures. Both are fundamentally N-dimensional:
+xarray has two core data structures, which build upon and extend the core
+strengths of  NumPy_ and pandas_. Both are fundamentally N-dimensional:
 
 - :py:class:`~xarray.DataArray` is our implementation of a labeled, N-dimensional
   array. It is an N-D generalization of a :py:class:`pandas.Series`. The name
@@ -43,8 +55,6 @@ xarray has two core data structures. Both are fundamentally N-dimensional:
   shared dimensions, and serves a similar purpose in xarray to the
   :py:class:`pandas.DataFrame`.
 
-.. _datarray: https://github.com/fperez/datarray
-
 The value of attaching labels to numpy's :py:class:`numpy.ndarray` may be
 fairly obvious, but the dataset may need more motivation.
 
@@ -69,23 +79,33 @@ metadata once, not every time you save a file.
 Goals and aspirations
 ---------------------
 
-pandas_ excels at working with tabular data. That suffices for many statistical
-analyses, but physical scientists rely on N-dimensional arrays -- which is
-where xarray comes in.
+Xarray contributes domain-agnostic data-structures and tools for labeled
+multi-dimensional arrays to Python's SciPy_ ecosystem for numerical computing.
+In particular, xarray builds upon and integrates with NumPy_ and pandas_:
+
+- Our user-facing interfaces aim to be more explicit verisons of those found in
+  NumPy/pandas.
+- Compatibility with the broader ecosystem is a major goal: it should be easy
+  to get your data in and out.
+- We try to keep a tight focus on functionality and interfaces related to
+  labeled data, and leverage other Python libraries for everything else, e.g.,
+  NumPy/pandas for fast arrays/indexing (xarray itself contains no compiled
+  code), Dask_ for parallel computing, matplotlib_ for plotting, etc.
+
+Xarray is a collaborative and community driven project, run entirely on
+volunteer effort (see :ref:`contributing`).
+Our target audience is anyone who needs N-dimensional labeled arrays in Python.
+Originally, development was driven by the data analysis needs of physical
+scientists (especially geoscientists who already know and love
+netCDF_), but it has become a much more broadly useful tool, and is still
+under active development.
+See our technical :ref:`roadmap` for more details, and feel free to reach out
+with questions about whether xarray is the right tool for your needs.
 
-xarray aims to provide a data analysis toolkit as powerful as pandas_ but
-designed for working with homogeneous N-dimensional arrays
-instead of tabular data. When possible, we copy the pandas API and rely on
-pandas's highly optimized internals (in particular, for fast indexing).
-
-Importantly, xarray has robust support for converting its objects to and
-from a numpy ``ndarray`` or a pandas ``DataFrame`` or ``Series``, providing
-compatibility with the full `PyData ecosystem <http://pydata.org/>`__.
-
-Our target audience is anyone who needs N-dimensional labeled arrays, but we
-are particularly focused on the data analysis needs of physical scientists --
-especially geoscientists who already know and love netCDF_.
-
-.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
+.. _datarray: https://github.com/fperez/datarray
+.. _Dask: http://dask.org
+.. _matplotlib: http://matplotlib.org
 .. _netCDF: http://www.unidata.ucar.edu/software/netcdf
+.. _NumPy: http://www.numpy.org
 .. _pandas: http://pandas.pydata.org
+.. _SciPy: http://www.scipy.org
diff --git a/setup.py b/setup.py
index 8c0c98ab33d..ff667d7a113 100644
--- a/setup.py
+++ b/setup.py
@@ -38,6 +38,25 @@
 that makes working with labelled multi-dimensional arrays simple,
 efficient, and fun!
 
+Xarray introduces labels in the form of dimensions, coordinates and
+attributes on top of raw NumPy_-like arrays, which allows for a more
+intuitive, more concise, and less error-prone developer experience.
+The package includes a large and growing library of domain-agnostic functions
+for advanced analytics and visualization with these data structures.
+
+Xarray was inspired by and borrows heavily from pandas_, the popular data
+analysis package focused on labelled tabular data.
+It is particularly tailored to working with netCDF_ files, which were the
+source of xarray's data model, and integrates tightly with dask_ for parallel
+computing.
+
+.. _NumPy: http://www.numpy.org/
+.. _pandas: http://pandas.pydata.org
+.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
+
+Why xarray?
+-----------
+
 Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
 "tensors") are an essential part of computational science.
 They are encountered in a wide range of fields, including physics, astronomy,
@@ -48,25 +67,25 @@
 they have labels which encode information about how the array values map
 to locations in space, time, etc.
 
-By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
-NumPy-like arrays, xarray is able to understand these labels and use them to
-provide a more intuitive, more concise, and less error-prone experience.
-Xarray also provides a large and growing library of functions for advanced
-analytics and visualization with these data structures.
-Xarray was inspired by and borrows heavily from pandas_, the popular data
-analysis package focused on labelled tabular data.
-Xarray can read and write data from most common labeled ND-array storage
-formats and is particularly tailored to working with netCDF_ files, which were
-the source of xarray's data model.
+Xarray doesn't just keep track of labels on arrays -- it uses them to provide a
+powerful and concise interface. For example:
 
-.. _NumPy: http://www.numpy.org/
-.. _pandas: http://pandas.pydata.org
-.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
+-  Apply operations over dimensions by name: ``x.sum('time')``.
+-  Select values by label instead of integer location:
+   ``x.loc['2014-01-01']`` or ``x.sel(time='2014-01-01')``.
+-  Mathematical operations (e.g., ``x - y``) vectorize across multiple
+   dimensions (array broadcasting) based on dimension names, not shape.
+-  Flexible split-apply-combine operations with groupby:
+   ``x.groupby('time.dayofyear').mean()``.
+-  Database like alignment based on coordinate labels that smoothly
+   handles missing values: ``x, y = xr.align(x, y, join='outer')``.
+-  Keep track of arbitrary metadata in the form of a Python dictionary:
+   ``x.attrs``.
 
-Important links
----------------
+Learn more
+----------
 
-- HTML documentation: http://xarray.pydata.org
+- Documentation: http://xarray.pydata.org
 - Issue tracker: http://github.com/pydata/xarray/issues
 - Source code: http://github.com/pydata/xarray
 - SciPy2015 talk: https://www.youtube.com/watch?v=X0pAhJgySxk