Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: refresh "Why xarray" and shorten top-level description #2657

Merged
merged 4 commits into from
Jan 11, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 25 additions & 59 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,49 +9,47 @@ xarray: N-D labeled arrays and datasets
:target: https://coveralls.io/r/pydata/xarray
.. image:: https://readthedocs.org/projects/xray/badge/?version=latest
:target: http://xarray.pydata.org/
.. image:: https://img.shields.io/pypi/v/xarray.svg
:target: https://pypi.python.org/pypi/xarray/
.. image:: https://zenodo.org/badge/13221727.svg
:target: https://zenodo.org/badge/latestdoi/13221727
.. image:: http://img.shields.io/badge/benchmarked%20by-asv-green.svg?style=flat
:target: http://pandas.pydata.org/speed/xarray/
.. image:: https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A
:target: http://numfocus.org
.. image:: https://img.shields.io/pypi/v/xarray.svg
:target: https://pypi.python.org/pypi/xarray/

**xarray** (formerly **xray**) is an open source project and Python package
that makes working with labelled multi-dimensional arrays simple,
efficient, and fun!

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
"tensors") are an essential part of computational science.
They are encountered in a wide range of fields, including physics, astronomy,
geoscience, bioinformatics, engineering, finance, and deep learning.
In Python, NumPy_ provides the fundamental data structure and API for
working with raw ND arrays.
However, real-world datasets are usually more than just raw numbers;
they have labels which encode information about how the array values map
to locations in space, time, etc.
Xarray introduces labels in the form of dimensions, coordinates and
attributes on top of raw NumPy_-like arrays, which allows for a more
intuitive, more concise, and less error-prone developer experience.
The package includes a large and growing library of domain-agnostic functions
for advanced analytics and visualization with these data structures.

By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
NumPy-like arrays, xarray is able to understand these labels and use them to
provide a more intuitive, more concise, and less error-prone experience.
Xarray also provides a large and growing library of functions for advanced
analytics and visualization with these data structures.
Xarray was inspired by and borrows heavily from pandas_, the popular data
analysis package focused on labelled tabular data.
Xarray can read and write data from most common labeled ND-array storage
formats and is particularly tailored to working with netCDF_ files, which were
the source of xarray's data model.
It is particularly tailored to working with netCDF_ files, which were the
source of xarray's data model, and integrates tightly with dask_ for parallel
computing.
jhamman marked this conversation as resolved.
Show resolved Hide resolved

.. _NumPy: http://www.numpy.org/
.. _NumPy: http://www.numpy.org
.. _pandas: http://pandas.pydata.org
.. _dask: http://dask.org
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf

Why xarray?
-----------

Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many
powerful array operations possible:
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
"tensors") are an essential part of computational science.
They are encountered in a wide range of fields, including physics, astronomy,
geoscience, bioinformatics, engineering, finance, and deep learning.
In Python, NumPy_ provides the fundamental data structure and API for
working with raw ND arrays.
However, real-world datasets are usually more than just raw numbers;
they have labels which encode information about how the array values map
to locations in space, time, etc.

Xarray doesn't just keep track of labels on arrays: it uses them to provide a
powerful and concise interface. For example:

- Apply operations over dimensions by name: ``x.sum('time')``.
- Select values by label instead of integer location:
Expand All @@ -65,42 +63,10 @@ powerful array operations possible:
- Keep track of arbitrary metadata in the form of a Python dictionary:
``x.attrs``.

pandas_ provides many of these features, but it does not make use of dimension
names, and its core data structures are fixed dimensional arrays.

Why isn't pandas enough?
------------------------

pandas_ excels at working with tabular data. That suffices for many statistical
analyses, but physical scientists rely on N-dimensional arrays -- which is
where xarray comes in.

xarray aims to provide a data analysis toolkit as powerful as pandas_ but
designed for working with homogeneous N-dimensional arrays
instead of tabular data. When possible, we copy the pandas API and rely on
pandas's highly optimized internals (in particular, for fast indexing).

Why netCDF?
-----------

Because xarray implements the same data model as the netCDF_ file format,
xarray datasets have a natural and portable serialization format. But it is also
easy to robustly convert an xarray ``DataArray`` to and from a numpy ``ndarray``
or a pandas ``DataFrame`` or ``Series``, providing compatibility with the full
`PyData ecosystem <http://pydata.org/>`__.

Our target audience is anyone who needs N-dimensional labeled arrays, but we
are particularly focused on the data analysis needs of physical scientists --
especially geoscientists who already know and love netCDF_.

.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
.. _pandas: http://pandas.pydata.org
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf

Documentation
-------------

The official documentation is hosted on ReadTheDocs at http://xarray.pydata.org/
Learn more about xarray in its official documentation at http://xarray.pydata.org/

Contributing
------------
Expand Down
9 changes: 5 additions & 4 deletions doc/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,8 @@ if you were using Panels:
xarray ``Dataset``.

You can :ref:`read about switching from Panels to Xarray here <panel transition>`.
Pandas gets a lot of things right, but scientific users need fully multi-
dimensional data structures.

Pandas gets a lot of things right, but scientific users (and many others) need
fully multi-dimensional data structures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a good place to explicitly mention deep learning libraries?


How do xarray data structures differ from those found in pandas?
----------------------------------------------------------------
Expand All @@ -65,7 +64,9 @@ multi-dimensional data-structures.

That said, you should only bother with xarray if some aspect of data is
fundamentally multi-dimensional. If your data is unstructured or
one-dimensional, stick with pandas.
one-dimensional, pandas is usually the right choice: it has better performance
for common operations such as ``groupby`` and you'll find far more usage
examples online.


Why don't aggregations return Python scalars?
Expand Down
30 changes: 11 additions & 19 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,21 @@ xarray: N-D labeled arrays and datasets in Python
that makes working with labelled multi-dimensional arrays simple,
efficient, and fun!

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
"tensors") are an essential part of computational science.
They are encountered in a wide range of fields, including physics, astronomy,
geoscience, bioinformatics, engineering, finance, and deep learning.
In Python, NumPy_ provides the fundamental data structure and API for
working with raw ND arrays.
However, real-world datasets are usually more than just raw numbers;
they have labels which encode information about how the array values map
to locations in space, time, etc.

By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
NumPy-like arrays, xarray is able to understand these labels and use them to
provide a more intuitive, more concise, and less error-prone experience.
Xarray also provides a large and growing library of functions for advanced
analytics and visualization with these data structures.
Xarray introduces labels in the form of dimensions, coordinates and
attributes on top of raw NumPy_-like arrays, which allows for a more
intuitive, more concise, and less error-prone developer experience.
The package includes a large and growing library of domain-agnostic functions
for advanced analytics and visualization with these data structures.

Xarray was inspired by and borrows heavily from pandas_, the popular data
analysis package focused on labelled tabular data.
Xarray can read and write data from most common labeled ND-array storage
formats and is particularly tailored to working with netCDF_ files, which were
the source of xarray's data model.
It is particularly tailored to working with netCDF_ files, which were the
source of xarray's data model, and integrates tightly with dask_ for parallel
computing.

.. _NumPy: http://www.numpy.org/
.. _NumPy: http://www.numpy.org
.. _pandas: http://pandas.pydata.org
.. _dask: http://dask.org
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf

Documentation
Expand Down
18 changes: 12 additions & 6 deletions doc/related-projects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Xarray related projects
-----------------------

Here below is a list of several existing libraries that build
Here below is a list of existing open source projects that build
functionality upon xarray. See also section :ref:`internals` for more
details on how to build xarray extensions.

Expand Down Expand Up @@ -39,11 +39,16 @@ Geosciences

Machine Learning
~~~~~~~~~~~~~~~~
- `cesium <http://cesium-ml.org/>`_: machine learning for time series analysis
- `ArviZ <https://arviz-devs.github.io/arviz/>`_: Exploratory analysis of Bayesian models, built on top of xarray.
- `Elm <https://ensemble-learning-models.readthedocs.io>`_: Parallel machine learning on xarray data structures
- `sklearn-xarray (1) <https://phausamann.github.io/sklearn-xarray>`_: Combines scikit-learn and xarray (1).
- `sklearn-xarray (2) <https://sklearn-xarray.readthedocs.io/en/latest/>`_: Combines scikit-learn and xarray (2).

Other domains
~~~~~~~~~~~~~
- `ptsa <https://pennmem.github.io/ptsa_new/html/index.html>`_: EEG Time Series Analysis
- `pycalphad <https://pycalphad.org/docs/latest/>`_: Computational Thermodynamics in Python

Extend xarray capabilities
~~~~~~~~~~~~~~~~~~~~~~~~~~
- `Collocate <https://github.com/cistools/collocate>`_: Collocate xarray trajectories in arbitrary physical dimensions
Expand All @@ -61,9 +66,10 @@ Visualization
- `hvplot <https://hvplot.pyviz.org/>`_ : A high-level plotting API for the PyData ecosystem built on HoloViews.
- `psyplot <https://psyplot.readthedocs.io>`_: Interactive data visualization with python.

Other
~~~~~
- `ptsa <https://pennmem.github.io/ptsa_new/html/index.html>`_: EEG Time Series Analysis
- `pycalphad <https://pycalphad.org/docs/latest/>`_: Computational Thermodynamics in Python
Non-Python projects
~~~~~~~~~~~~~~~~~~~
- `xframe <https://github.com/QuantStack/xframe>`_: C++ data structures inspired by xarray.
- `AxisArrays <https://github.com/JuliaArrays/AxisArrays.jl>`_ and
`NamedArrays <https://github.com/davidavdav/NamedArrays.jl>`_: similar data structures for Julia.

More projects can be found at the `"xarray" Github topic <https://github.com/topics/xarray>`_.
76 changes: 48 additions & 28 deletions doc/why-xarray.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,21 @@
Overview: Why xarray?
=====================

Features
--------

Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many
powerful array operations possible:
What labels enable
------------------

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
"tensors") are an essential part of computational science.
They are encountered in a wide range of fields, including physics, astronomy,
geoscience, bioinformatics, engineering, finance, and deep learning.
In Python, NumPy_ provides the fundamental data structure and API for
working with raw ND arrays.
However, real-world datasets are usually more than just raw numbers;
they have labels which encode information about how the array values map
to locations in space, time, etc.

Xarray doesn't just keep track of labels on arrays: it uses them to provide a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use a dash instead of a colon

powerful and concise interface. For example:

- Apply operations over dimensions by name: ``x.sum('time')``.
- Select values by label instead of integer location:
Expand All @@ -19,20 +29,22 @@ powerful array operations possible:
- Keep track of arbitrary metadata in the form of a Python dictionary:
``x.attrs``.

pandas_ provides many of these features, but it does not make use of dimension
names, and its core data structures are fixed dimensional arrays.

The N-dimensional nature of xarray's data structures makes it suitable for dealing
with multi-dimensional scientific data, and its use of dimension names
instead of axis labels (``dim='time'`` instead of ``axis=0``) makes such
arrays much more manageable than the raw numpy ndarray: with xarray, you don't
need to keep track of the order of arrays dimensions or insert dummy dimensions
(e.g., ``np.newaxis``) to align arrays.

The immediate payoff of using xarray is that you'll write less code. The
long-term payoff is that you'll understand what you were thinking when you come
back to look at it weeks or months later.

Core data structures
--------------------

xarray has two core data structures. Both are fundamentally N-dimensional:
xarray has two core data structures, which build upon and extend the core
strengths of NumPy_ and pandas_. Both are fundamentally N-dimensional:

- :py:class:`~xarray.DataArray` is our implementation of a labeled, N-dimensional
array. It is an N-D generalization of a :py:class:`pandas.Series`. The name
Expand All @@ -43,8 +55,6 @@ xarray has two core data structures. Both are fundamentally N-dimensional:
shared dimensions, and serves a similar purpose in xarray to the
:py:class:`pandas.DataFrame`.

.. _datarray: https://github.com/fperez/datarray

The value of attaching labels to numpy's :py:class:`numpy.ndarray` may be
fairly obvious, but the dataset may need more motivation.

Expand All @@ -69,23 +79,33 @@ metadata once, not every time you save a file.
Goals and aspirations
---------------------

pandas_ excels at working with tabular data. That suffices for many statistical
analyses, but physical scientists rely on N-dimensional arrays -- which is
where xarray comes in.
Xarray contributes domain-agnostic data-structures and tools for labeled
multi-dimensional arrays to Python's SciPy_ ecosystem for numerical computing.
In particular, xarray builds upon and integrates with NumPy_ and pandas_:

- Our user-facing interfaces aim to be more explicit verisons of those found in
NumPy/pandas.
- Compatibility with the broader ecosystem is a major goal: it should be easy
to get your data in and out.
- We try to keep a tight focus on functionality and interfaces related to
labeled data, and leverage other Python libraries for everything else, e.g.,
NumPy/pandas for fast arrays/indexing (xarray itself contains no compiled
code), Dask_ for parallel computing, matplotlib_ for plotting, etc.

Xarray is a collaborative and community driven project, run entirely on
volunteer effort (see :ref:`contributing`).
Our target audience is anyone who needs N-dimensional labeled arrays in Python.
Originally, development was driven by the data analysis needs of physical
scientists (especially geoscientists who already know and love
netCDF_), but it has become a much more broadly useful tool, and is still
under active development.
See our technical :ref:`roadmap` for more details, and feel free to reach out
with questions about whether xarray is the right tool for your needs.

xarray aims to provide a data analysis toolkit as powerful as pandas_ but
designed for working with homogeneous N-dimensional arrays
instead of tabular data. When possible, we copy the pandas API and rely on
pandas's highly optimized internals (in particular, for fast indexing).

Importantly, xarray has robust support for converting its objects to and
from a numpy ``ndarray`` or a pandas ``DataFrame`` or ``Series``, providing
compatibility with the full `PyData ecosystem <http://pydata.org/>`__.

Our target audience is anyone who needs N-dimensional labeled arrays, but we
are particularly focused on the data analysis needs of physical scientists --
especially geoscientists who already know and love netCDF_.

.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
.. _datarray: https://github.com/fperez/datarray
.. _Dask: http://dask.org
.. _matplotlib: http://matplotlib.org
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
.. _NumPy: http://www.numpy.org
.. _pandas: http://pandas.pydata.org
.. _SciPy: http://www.scipy.org
Loading