From d4c46829b283ab7e7b7db8b86dae77861ce68f3c Mon Sep 17 00:00:00 2001 From: Stephan Hoyer Date: Thu, 10 Jan 2019 17:06:10 -0800 Subject: [PATCH] DOC: refresh "Why xarray" and shorten top-level description (#2657) * DOC: refresh "Why xarray" and shorten top-level description This documentation revamp builds upon rabernat's rewrite in GH2430. The main change is that the three paragraph description felt too long to me, so I moved the background paragraph on multi-dimensional arrays into the next section, on "Why xarray". I also ended up rewriting most of that page, and made a few adjustments to the FAQ and related projects pages. * 'why xarray' in setup.py, too * Updates per review --- README.rst | 84 ++++++++++++---------------------------- doc/faq.rst | 14 ++++--- doc/index.rst | 30 ++++++-------- doc/related-projects.rst | 18 ++++++--- doc/why-xarray.rst | 76 ++++++++++++++++++++++-------------- setup.py | 51 ++++++++++++++++-------- 6 files changed, 139 insertions(+), 134 deletions(-) diff --git a/README.rst b/README.rst index a4c8f6d200b..f30d9dde8bb 100644 --- a/README.rst +++ b/README.rst @@ -9,49 +9,47 @@ xarray: N-D labeled arrays and datasets :target: https://coveralls.io/r/pydata/xarray .. image:: https://readthedocs.org/projects/xray/badge/?version=latest :target: http://xarray.pydata.org/ -.. image:: https://img.shields.io/pypi/v/xarray.svg - :target: https://pypi.python.org/pypi/xarray/ -.. image:: https://zenodo.org/badge/13221727.svg - :target: https://zenodo.org/badge/latestdoi/13221727 .. image:: http://img.shields.io/badge/benchmarked%20by-asv-green.svg?style=flat :target: http://pandas.pydata.org/speed/xarray/ -.. image:: https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A - :target: http://numfocus.org +.. image:: https://img.shields.io/pypi/v/xarray.svg + :target: https://pypi.python.org/pypi/xarray/ **xarray** (formerly **xray**) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun! -Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called -"tensors") are an essential part of computational science. -They are encountered in a wide range of fields, including physics, astronomy, -geoscience, bioinformatics, engineering, finance, and deep learning. -In Python, NumPy_ provides the fundamental data structure and API for -working with raw ND arrays. -However, real-world datasets are usually more than just raw numbers; -they have labels which encode information about how the array values map -to locations in space, time, etc. +Xarray introduces labels in the form of dimensions, coordinates and +attributes on top of raw NumPy_-like arrays, which allows for a more +intuitive, more concise, and less error-prone developer experience. +The package includes a large and growing library of domain-agnostic functions +for advanced analytics and visualization with these data structures. -By introducing *dimensions*, *coordinates*, and *attributes* on top of raw -NumPy-like arrays, xarray is able to understand these labels and use them to -provide a more intuitive, more concise, and less error-prone experience. -Xarray also provides a large and growing library of functions for advanced -analytics and visualization with these data structures. Xarray was inspired by and borrows heavily from pandas_, the popular data analysis package focused on labelled tabular data. -Xarray can read and write data from most common labeled ND-array storage -formats and is particularly tailored to working with netCDF_ files, which were -the source of xarray's data model. +It is particularly tailored to working with netCDF_ files, which were the +source of xarray's data model, and integrates tightly with dask_ for parallel +computing. -.. _NumPy: http://www.numpy.org/ +.. _NumPy: http://www.numpy.org .. _pandas: http://pandas.pydata.org +.. _dask: http://dask.org .. _netCDF: http://www.unidata.ucar.edu/software/netcdf Why xarray? ----------- -Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many -powerful array operations possible: +Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called +"tensors") are an essential part of computational science. +They are encountered in a wide range of fields, including physics, astronomy, +geoscience, bioinformatics, engineering, finance, and deep learning. +In Python, NumPy_ provides the fundamental data structure and API for +working with raw ND arrays. +However, real-world datasets are usually more than just raw numbers; +they have labels which encode information about how the array values map +to locations in space, time, etc. + +Xarray doesn't just keep track of labels on arrays -- it uses them to provide a +powerful and concise interface. For example: - Apply operations over dimensions by name: ``x.sum('time')``. - Select values by label instead of integer location: @@ -65,42 +63,10 @@ powerful array operations possible: - Keep track of arbitrary metadata in the form of a Python dictionary: ``x.attrs``. -pandas_ provides many of these features, but it does not make use of dimension -names, and its core data structures are fixed dimensional arrays. - -Why isn't pandas enough? ------------------------- - -pandas_ excels at working with tabular data. That suffices for many statistical -analyses, but physical scientists rely on N-dimensional arrays -- which is -where xarray comes in. - -xarray aims to provide a data analysis toolkit as powerful as pandas_ but -designed for working with homogeneous N-dimensional arrays -instead of tabular data. When possible, we copy the pandas API and rely on -pandas's highly optimized internals (in particular, for fast indexing). - -Why netCDF? ------------ - -Because xarray implements the same data model as the netCDF_ file format, -xarray datasets have a natural and portable serialization format. But it is also -easy to robustly convert an xarray ``DataArray`` to and from a numpy ``ndarray`` -or a pandas ``DataFrame`` or ``Series``, providing compatibility with the full -`PyData ecosystem `__. - -Our target audience is anyone who needs N-dimensional labeled arrays, but we -are particularly focused on the data analysis needs of physical scientists -- -especially geoscientists who already know and love netCDF_. - -.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html -.. _pandas: http://pandas.pydata.org -.. _netCDF: http://www.unidata.ucar.edu/software/netcdf - Documentation ------------- -The official documentation is hosted on ReadTheDocs at http://xarray.pydata.org/ +Learn more about xarray in its official documentation at http://xarray.pydata.org/ Contributing ------------ diff --git a/doc/faq.rst b/doc/faq.rst index 44bc021024b..465a5a6d250 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -18,8 +18,9 @@ pandas is a fantastic library for analysis of low-dimensional labelled data - if it can be sensibly described as "rows and columns", pandas is probably the right choice. However, sometimes we want to use higher dimensional arrays (`ndim > 2`), or arrays for which the order of dimensions (e.g., columns vs -rows) shouldn't really matter. For example, climate and weather data is often -natively expressed in 4 or more dimensions: time, x, y and z. +rows) shouldn't really matter. For example, the images of a movie can be +natively represented as an array with four dimensions: time, row, column and +color. Pandas has historically supported N-dimensional panels, but deprecated them in version 0.20 in favor of Xarray data structures. There are now built-in methods @@ -39,9 +40,8 @@ if you were using Panels: xarray ``Dataset``. You can :ref:`read about switching from Panels to Xarray here `. -Pandas gets a lot of things right, but scientific users need fully multi- -dimensional data structures. - +Pandas gets a lot of things right, but many science, engineering and complex +analytics use cases need fully multi-dimensional data structures. How do xarray data structures differ from those found in pandas? ---------------------------------------------------------------- @@ -65,7 +65,9 @@ multi-dimensional data-structures. That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or -one-dimensional, stick with pandas. +one-dimensional, pandas is usually the right choice: it has better performance +for common operations such as ``groupby`` and you'll find far more usage +examples online. Why don't aggregations return Python scalars? diff --git a/doc/index.rst b/doc/index.rst index fe6d2874953..dbe911011cd 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -5,29 +5,21 @@ xarray: N-D labeled arrays and datasets in Python that makes working with labelled multi-dimensional arrays simple, efficient, and fun! -Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called -"tensors") are an essential part of computational science. -They are encountered in a wide range of fields, including physics, astronomy, -geoscience, bioinformatics, engineering, finance, and deep learning. -In Python, NumPy_ provides the fundamental data structure and API for -working with raw ND arrays. -However, real-world datasets are usually more than just raw numbers; -they have labels which encode information about how the array values map -to locations in space, time, etc. - -By introducing *dimensions*, *coordinates*, and *attributes* on top of raw -NumPy-like arrays, xarray is able to understand these labels and use them to -provide a more intuitive, more concise, and less error-prone experience. -Xarray also provides a large and growing library of functions for advanced -analytics and visualization with these data structures. +Xarray introduces labels in the form of dimensions, coordinates and +attributes on top of raw NumPy_-like arrays, which allows for a more +intuitive, more concise, and less error-prone developer experience. +The package includes a large and growing library of domain-agnostic functions +for advanced analytics and visualization with these data structures. + Xarray was inspired by and borrows heavily from pandas_, the popular data analysis package focused on labelled tabular data. -Xarray can read and write data from most common labeled ND-array storage -formats and is particularly tailored to working with netCDF_ files, which were -the source of xarray's data model. +It is particularly tailored to working with netCDF_ files, which were the +source of xarray's data model, and integrates tightly with dask_ for parallel +computing. -.. _NumPy: http://www.numpy.org/ +.. _NumPy: http://www.numpy.org .. _pandas: http://pandas.pydata.org +.. _dask: http://dask.org .. _netCDF: http://www.unidata.ucar.edu/software/netcdf Documentation diff --git a/doc/related-projects.rst b/doc/related-projects.rst index cf89c715bc7..c89e324ff7c 100644 --- a/doc/related-projects.rst +++ b/doc/related-projects.rst @@ -3,7 +3,7 @@ Xarray related projects ----------------------- -Here below is a list of several existing libraries that build +Here below is a list of existing open source projects that build functionality upon xarray. See also section :ref:`internals` for more details on how to build xarray extensions. @@ -39,11 +39,16 @@ Geosciences Machine Learning ~~~~~~~~~~~~~~~~ -- `cesium `_: machine learning for time series analysis +- `ArviZ `_: Exploratory analysis of Bayesian models, built on top of xarray. - `Elm `_: Parallel machine learning on xarray data structures - `sklearn-xarray (1) `_: Combines scikit-learn and xarray (1). - `sklearn-xarray (2) `_: Combines scikit-learn and xarray (2). +Other domains +~~~~~~~~~~~~~ +- `ptsa `_: EEG Time Series Analysis +- `pycalphad `_: Computational Thermodynamics in Python + Extend xarray capabilities ~~~~~~~~~~~~~~~~~~~~~~~~~~ - `Collocate `_: Collocate xarray trajectories in arbitrary physical dimensions @@ -61,9 +66,10 @@ Visualization - `hvplot `_ : A high-level plotting API for the PyData ecosystem built on HoloViews. - `psyplot `_: Interactive data visualization with python. -Other -~~~~~ -- `ptsa `_: EEG Time Series Analysis -- `pycalphad `_: Computational Thermodynamics in Python +Non-Python projects +~~~~~~~~~~~~~~~~~~~ +- `xframe `_: C++ data structures inspired by xarray. +- `AxisArrays `_ and + `NamedArrays `_: similar data structures for Julia. More projects can be found at the `"xarray" Github topic `_. diff --git a/doc/why-xarray.rst b/doc/why-xarray.rst index e9f30fe25be..d0a6c591b29 100644 --- a/doc/why-xarray.rst +++ b/doc/why-xarray.rst @@ -1,11 +1,21 @@ Overview: Why xarray? ===================== -Features --------- - -Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many -powerful array operations possible: +What labels enable +------------------ + +Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called +"tensors") are an essential part of computational science. +They are encountered in a wide range of fields, including physics, astronomy, +geoscience, bioinformatics, engineering, finance, and deep learning. +In Python, NumPy_ provides the fundamental data structure and API for +working with raw ND arrays. +However, real-world datasets are usually more than just raw numbers; +they have labels which encode information about how the array values map +to locations in space, time, etc. + +Xarray doesn't just keep track of labels on arrays -- it uses them to provide a +powerful and concise interface. For example: - Apply operations over dimensions by name: ``x.sum('time')``. - Select values by label instead of integer location: @@ -19,9 +29,6 @@ powerful array operations possible: - Keep track of arbitrary metadata in the form of a Python dictionary: ``x.attrs``. -pandas_ provides many of these features, but it does not make use of dimension -names, and its core data structures are fixed dimensional arrays. - The N-dimensional nature of xarray's data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (``dim='time'`` instead of ``axis=0``) makes such @@ -29,10 +36,15 @@ arrays much more manageable than the raw numpy ndarray: with xarray, you don't need to keep track of the order of arrays dimensions or insert dummy dimensions (e.g., ``np.newaxis``) to align arrays. +The immediate payoff of using xarray is that you'll write less code. The +long-term payoff is that you'll understand what you were thinking when you come +back to look at it weeks or months later. + Core data structures -------------------- -xarray has two core data structures. Both are fundamentally N-dimensional: +xarray has two core data structures, which build upon and extend the core +strengths of NumPy_ and pandas_. Both are fundamentally N-dimensional: - :py:class:`~xarray.DataArray` is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a :py:class:`pandas.Series`. The name @@ -43,8 +55,6 @@ xarray has two core data structures. Both are fundamentally N-dimensional: shared dimensions, and serves a similar purpose in xarray to the :py:class:`pandas.DataFrame`. -.. _datarray: https://github.com/fperez/datarray - The value of attaching labels to numpy's :py:class:`numpy.ndarray` may be fairly obvious, but the dataset may need more motivation. @@ -69,23 +79,33 @@ metadata once, not every time you save a file. Goals and aspirations --------------------- -pandas_ excels at working with tabular data. That suffices for many statistical -analyses, but physical scientists rely on N-dimensional arrays -- which is -where xarray comes in. +Xarray contributes domain-agnostic data-structures and tools for labeled +multi-dimensional arrays to Python's SciPy_ ecosystem for numerical computing. +In particular, xarray builds upon and integrates with NumPy_ and pandas_: + +- Our user-facing interfaces aim to be more explicit verisons of those found in + NumPy/pandas. +- Compatibility with the broader ecosystem is a major goal: it should be easy + to get your data in and out. +- We try to keep a tight focus on functionality and interfaces related to + labeled data, and leverage other Python libraries for everything else, e.g., + NumPy/pandas for fast arrays/indexing (xarray itself contains no compiled + code), Dask_ for parallel computing, matplotlib_ for plotting, etc. + +Xarray is a collaborative and community driven project, run entirely on +volunteer effort (see :ref:`contributing`). +Our target audience is anyone who needs N-dimensional labeled arrays in Python. +Originally, development was driven by the data analysis needs of physical +scientists (especially geoscientists who already know and love +netCDF_), but it has become a much more broadly useful tool, and is still +under active development. +See our technical :ref:`roadmap` for more details, and feel free to reach out +with questions about whether xarray is the right tool for your needs. -xarray aims to provide a data analysis toolkit as powerful as pandas_ but -designed for working with homogeneous N-dimensional arrays -instead of tabular data. When possible, we copy the pandas API and rely on -pandas's highly optimized internals (in particular, for fast indexing). - -Importantly, xarray has robust support for converting its objects to and -from a numpy ``ndarray`` or a pandas ``DataFrame`` or ``Series``, providing -compatibility with the full `PyData ecosystem `__. - -Our target audience is anyone who needs N-dimensional labeled arrays, but we -are particularly focused on the data analysis needs of physical scientists -- -especially geoscientists who already know and love netCDF_. - -.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html +.. _datarray: https://github.com/fperez/datarray +.. _Dask: http://dask.org +.. _matplotlib: http://matplotlib.org .. _netCDF: http://www.unidata.ucar.edu/software/netcdf +.. _NumPy: http://www.numpy.org .. _pandas: http://pandas.pydata.org +.. _SciPy: http://www.scipy.org diff --git a/setup.py b/setup.py index 8c0c98ab33d..ff667d7a113 100644 --- a/setup.py +++ b/setup.py @@ -38,6 +38,25 @@ that makes working with labelled multi-dimensional arrays simple, efficient, and fun! +Xarray introduces labels in the form of dimensions, coordinates and +attributes on top of raw NumPy_-like arrays, which allows for a more +intuitive, more concise, and less error-prone developer experience. +The package includes a large and growing library of domain-agnostic functions +for advanced analytics and visualization with these data structures. + +Xarray was inspired by and borrows heavily from pandas_, the popular data +analysis package focused on labelled tabular data. +It is particularly tailored to working with netCDF_ files, which were the +source of xarray's data model, and integrates tightly with dask_ for parallel +computing. + +.. _NumPy: http://www.numpy.org/ +.. _pandas: http://pandas.pydata.org +.. _netCDF: http://www.unidata.ucar.edu/software/netcdf + +Why xarray? +----------- + Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called "tensors") are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, @@ -48,25 +67,25 @@ they have labels which encode information about how the array values map to locations in space, time, etc. -By introducing *dimensions*, *coordinates*, and *attributes* on top of raw -NumPy-like arrays, xarray is able to understand these labels and use them to -provide a more intuitive, more concise, and less error-prone experience. -Xarray also provides a large and growing library of functions for advanced -analytics and visualization with these data structures. -Xarray was inspired by and borrows heavily from pandas_, the popular data -analysis package focused on labelled tabular data. -Xarray can read and write data from most common labeled ND-array storage -formats and is particularly tailored to working with netCDF_ files, which were -the source of xarray's data model. +Xarray doesn't just keep track of labels on arrays -- it uses them to provide a +powerful and concise interface. For example: -.. _NumPy: http://www.numpy.org/ -.. _pandas: http://pandas.pydata.org -.. _netCDF: http://www.unidata.ucar.edu/software/netcdf +- Apply operations over dimensions by name: ``x.sum('time')``. +- Select values by label instead of integer location: + ``x.loc['2014-01-01']`` or ``x.sel(time='2014-01-01')``. +- Mathematical operations (e.g., ``x - y``) vectorize across multiple + dimensions (array broadcasting) based on dimension names, not shape. +- Flexible split-apply-combine operations with groupby: + ``x.groupby('time.dayofyear').mean()``. +- Database like alignment based on coordinate labels that smoothly + handles missing values: ``x, y = xr.align(x, y, join='outer')``. +- Keep track of arbitrary metadata in the form of a Python dictionary: + ``x.attrs``. -Important links ---------------- +Learn more +---------- -- HTML documentation: http://xarray.pydata.org +- Documentation: http://xarray.pydata.org - Issue tracker: http://github.com/pydata/xarray/issues - Source code: http://github.com/pydata/xarray - SciPy2015 talk: https://www.youtube.com/watch?v=X0pAhJgySxk