Skip to content

Commit

Permalink
Expanded (and transposed) performance guide table (#961)
Browse files Browse the repository at this point in the history
  • Loading branch information
jbednar authored Nov 11, 2020
1 parent 1f54bcf commit 4be2e0e
Showing 1 changed file with 137 additions and 110 deletions.
247 changes: 137 additions & 110 deletions examples/user_guide/10_Performance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"\n",
"Datashader performance will vary significantly depending on the library and specific data object type used to represent the data in Python, because different libraries and data objects have very different abilities to use the available processing power and memory. Moreover, different libraries and objects are appropriate for different types of data, due to how they organize and store the data internally as well as the operations they provide for working with the data. The various data container objects available from the supported libraries all fall into one of the following three types of data structures:\n",
"- **[Columnar (tabular) data](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)**: Relational, table-like data consisting of arbitrarily many rows, each with data for a fixed number of columns. For example, if you track the location of a particular cell phone over time, each time sampled would be a row, and for each time there could be columns for the latitude and longitude for the location at that time.\n",
"- **[Multidimensional (nd-array) data](http://xarray.pydata.org/en/stable/why-xarray.html)**: Data laid out in _n_ dimensions, where _n_ is typically >1. For example, you might have the precipitation measured on a latitude and longitude grid covering the whole world, for every time at which precipitation was measured. Such data could be stored columnarly, but it would be very inefficient; instead it is stored as a three dimensional array of precipitation values, indexed with time, latitude, and longitude.\n",
"- **[n-D arrays (multidimensional data)](http://xarray.pydata.org/en/stable/why-xarray.html)**: Data laid out in _n_ dimensions, where _n_ is typically >1. For example, you might have the precipitation measured on a latitude and longitude grid covering the whole world, for every time at which precipitation was measured. Such data could be stored columnarly, but it would be very inefficient; instead it is stored as a three dimensional array of precipitation values, indexed with time, latitude, and longitude.\n",
"- **[Ragged arrays](https://en.wikipedia.org/wiki/Jagged_array)**: Relational/columnar data where the value of at least one column is a list of values that could vary in length for each row. For example, you may have a table with one row per US state and columns for population, land area, and the geographic shape of that state. Here the shape would be stored as a polygon consisting of an arbitrarily long list of latitude and longitude coordinates, which does not fit efficiently into a standard columnar data structure due to its ragged (variable length) nature.\n",
"\n",
"As you can see, all three examples include latitude and longitude values, but they are very different data structures that need to be stored differently for them to be processed efficiently. \n",
Expand All @@ -62,7 +62,7 @@
"\n",
"- **Single-core CPU**: All processing is done serially on a single core of a single CPU on one machine. This is the most common and straightforward implementation, but the slowest, as there are generally other processing resources available that could be used.\n",
"- **Multi-core CPU**: All processing is done on a single CPU, but using multiple threads running on multiple separate cores. This approach is able to make use of all of the CPU resources available on a given machine, but cannot use multiple machines.\n",
"- **Distributed CPU**: Processing is distributed across multiple cores that may be on multiple CPUs in a cluster or a set of cloud-based machines. This approach can be much faster than single-CPU approaches when many CPUs are available.\n",
"- **Distributed CPU**: Processing is distributed across multiple cores that may be on multiple CPUs in a cluster or a set of cloud-based machines. This approach can be much faster than single-CPU approaches when many CPUs are available. Distributed approaches also normally support multi-core usage, utilizing multiple cores on a single or on multiple CPUs.\n",
"- **GPU**: Processing is done not on the CPU but on a separate general-purpose graphics-processing unit (GP-GPU). The GPU typically has far more (though individually less powerful) cores available than a CPU does, and for highly parallelizable computations like those in Datashader a GPU can typically achieve much faster performance at a given price point than a CPU or distributed set of CPUs can. However, not all machines have a supported GPU, memory available on the GPU is often limited, and it takes special effort to support a GPU (e.g. to avoid expensive copying of data between CPU and GPU memory), and so not all CPU code has been rewritten appropriately. \n",
"- **Distributed GPUs**: When there are multiple GPUs available, processing can be distributed across all of the GPUs to further speed it up or to fit large problems into the larger total amount of GPU memory available across all of them.\n",
"\n",
Expand All @@ -73,140 +73,167 @@
"\n",
"Given all of these options, the data objects currently supported by Datashader are:\n",
"\n",
"- **[Pandas DataFrame](https://pandas.pydata.org)**: Basic single-core, single CPU, CPU-only, in-core, non-ragged columnar data support, typically lower performance than the other options where those are supported.\n",
"- **[Dask DataFrame (using Pandas DataFrame internally)](https://dask.org)**: Multi-core, multi-CPU, in-core or out-of-core, distributed CPU computation. \n",
"- **[cuDF](https://github.com/rapidsai/cudf)**: Single NVIDIA-hardware GPU DataFrame.\n",
"- **[Dask DataFrame (using cuDF DataFrame internally)](https://rapidsai.github.io/projects/cudf/en/0.10.0/10min.html)**: Distributed GPUs with a Dask API on multiple NVIDIA GPUs on the same or different machines.\n",
"- **[SpatialPandas DataFrame](https://github.com/holoviz/spatialpandas)**: Pandas DataFrame with support for ragged arrays and spatial indexing (efficient access of spatially distributed values), typically using one core of one CPU.\n",
"- **[Dask (using SpatialPandas DataFrame internally)](https://github.com/holoviz/spatialpandas)**: Distributed CPU processing, in-core or out-of-core, using Dask's DataFrame API built on SpatialPandas.\n",
"- **[Xarray+NumPy](http://xarray.pydata.org)**: Multidimensional data operation built on NumPy arrays.\n",
"- **[Xarray+DaskArray](https://dask.org/)**: Dask-based multidimensional array processing built on Dask arrays, with support for distributed (multi-CPU) operation.\n",
"- **[Xarray+CuPy](https://cupy.chainer.org)**: Multidimensional array processing built on CuPy arrays, with storage and processing on a single NVIDIA GPU.\n",
"\n",
"Datashader's current release supports these libraries for nearly all of the Canvas glyph types (points, lines, etc.) where they would naturally apply. Supported combinations of glyph and data library are listed in this table, where the entries mean:\n",
"\n",
"- **Yes**: Supported\n",
"- **No**: Not (yet) supported, but could be with sufficient development effort (feel free to contribute effort or funding!)\n",
"- **-**: Not supported because that combination is not normally appropriate or useful (e.g. columnar data libraries do not currently provide efficient multidimensional array support)\n",
"\n",
"<style type=\"text/css\">.arbit .trary a { color: inherit; }.arbit .trary\n",
".sL{text-align:center;padding:2px 2px 2px 2px;background-color:#ffffff;font-weight:bold;width:60px}.arbit .trary\n",
".sG{text-align:center;padding:2px 2px 2px 2px;background-color:#ffffff;font-weight:bold;font-family:monospace}.arbit .trary\n",
".sY{text-align:center;padding:2px 2px 2px 2px;background-color:#b7e1cd;}.arbit .trary \n",
".sN{text-align:center;padding:2px 2px 2px 2px;background-color:#f4c7c3;}.arbit .trary\n",
".sM{text-align:center;padding:2px 2px 2px 2px;background-color:#fce8b2;}.arbit .trary\n",
"</style>\n",
"\n",
"<div class=\"arbit\">\n",
"<table class=\"trary\" cellspacing=\"0\" cellpadding=\"0\">\n",
"<thead><tr>\n",
"<th class=\"sL\">Glyph</th>\n",
"<th class=\"sL\">PandasDF</th>\n",
"<th class=\"sL\">DaskDF + PandasDF</th>\n",
"<th class=\"sL\">cuDF</th>\n",
"<th class=\"sL\">DaskDF + cuDF</th>\n",
"<th class=\"sL\">SpatialPandasDF</th>\n",
"<th class=\"sL\">Dask + SpatialPandasDF</th>\n",
"<th class=\"sL\">Xarray + NumPy</th>\n",
"<th class=\"sL\">Xarray + DaskArray</th>\n",
"<th class=\"sL\">Xarray + CuPy</th>\n",
"</tr></thead>\n",
"<thead>\n",
"<tr>\n",
" <th class=\"sG\">Data object</th>\n",
" <th class=\"sG\">Structure</th>\n",
" <th class=\"sG\">Compute</th>\n",
" <th class=\"sG\">Memory</th>\n",
" <th class=\"sG\">Description</th>\n",
" <th class=\"sG\">points</th>\n",
" <th class=\"sG\">line</th>\n",
" <th class=\"sG\">area</th>\n",
" <th class=\"sG\">trimesh</th>\n",
" <th class=\"sG\">raster</th>\n",
" <th class=\"sG\">quadmesh</th>\n",
" <th class=\"sG\">polygons</th>\n",
"</tr>\n",
"</thead>\n",
"<tbody>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.points</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
" <td class=\"sL\"><a href=\"https://pandas.pydata.org\">Pandas DF</a></td>\n",
" <td>columnar</td>\n",
" <td>1-core CPU</td>\n",
" <td>in-core</td>\n",
" <td>Standard dataframes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.line</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
" <td class=\"sL\"><a href=\"https://dask.org\">DaskDF + PandasDF</a></td>\n",
" <td>columnar</td>\n",
" <td>distributed CPU</td>\n",
" <td>out-of-core</td>\n",
" <td>Distributed DataFrames</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.area</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
" <td class=\"sL\"><a href=\"https://github.com/rapidsai/cudf\">cuDF</a></td>\n",
" <td>columnar</td>\n",
" <td>single GPU</td>\n",
" <td>in-core</td>\n",
" <td>NVIDIA GPU DataFrames</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.trimesh</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
"<td class=\"sN\">No</td>\n",
" <td class=\"sL\"><a href=\"https://docs.rapids.ai/api/cudf/stable/10min.html\">DaskDF + cuDF</a></td>\n",
" <td>columnar</td>\n",
" <td>distributed GPU</td>\n",
" <td>out-of-core</td>\n",
" <td>Distributed NVIDIA GPUs</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.raster</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sN\">No</td>\n",
" <td class=\"sL\"><a href=\"https://github.com/holoviz/spatialpandas\">SpatialPandasDF</a></td>\n",
" <td>ragged</td>\n",
" <td>1-core CPU</td>\n",
" <td>in-core</td>\n",
" <td>Ragged + spatial indexing</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sY\">Yes</td>\n",
"</tr>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.quadmesh</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
" <td class=\"sL\"><a href=\"https://github.com/holoviz/spatialpandas\">Dask + SpatialPandasDF</a></td>\n",
" <td>ragged</td>\n",
" <td>distributed CPU</td>\n",
" <td>out-of-core</td>\n",
" <td>Distributed ragged arrays</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sM\">-</td>\n",
" <td class=\"sY\">Yes</td>\n",
"</tr>\n",
"\n",
"<tr>\n",
"<td class=\"sG\">Canvas.polygons</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sY\">Yes</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
"<td class=\"sM\">-</td>\n",
" <td class=\"sL\"><a href=\"http://xarray.pydata.org\">Xarray + NumPy</a></td>\n",
" <td>n-D</td>\n",
" <td>1-core CPU</td>\n",
" <td>in-core</td>\n",
" <td>n-D CPU arrays</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"<tr>\n",
" <td class=\"sL\"><a href=\"(https://dask.org\">Xarray+DaskArray</a></td>\n",
" <td>n-D</td>\n",
" <td>distributed CPU</td>\n",
" <td>out-of-core</td>\n",
" <td>Distributed n-D arrays</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"<tr>\n",
" <td class=\"sL\"><a href=\"https://cupy.chainer.org\">Xarray+CuPy</a></td>\n",
" <td>n-D</td>\n",
" <td>single GPU</td>\n",
" <td>in-core</td>\n",
" <td>Single-GPU n-D arrays</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sN\">No</td>\n",
" <td class=\"sY\">Yes</td>\n",
" <td class=\"sM\">-</td>\n",
"</tr>\n",
"\n",
"</tbody></table></div>\n",
"\n",
"In general, all it takes to use the indicated data library for a particular glyph type is to instantiate a DataFrame (Pandas, Dask, cuDF, SpatialPandas) or DataArray/DataSet (Xarray), and then pass it to the appropriate `ds.Canvas` method call, as illustrated in the various examples in the user guide and topics. \n",
"The right half of this table shows which of Datashader's glyph types (`Canvas.points`, `Canvas.line`, etc.) are supported for that type of data object. Nearly all of glyphs are supported where they would naturally apply, listed using the key:\n",
"\n",
"- **Yes**: Supported\n",
"- **No**: Not (yet) supported, but could be with sufficient development effort (feel free to contribute effort or funding!)\n",
"- **-**: Not supported because that combination is not normally appropriate or useful (e.g. columnar data libraries do not currently provide efficient multidimensional array support)\n",
"\n",
"In general, all it takes to use the indicated data object for a particular glyph type is to instantiate a DataFrame (Pandas, Dask, cuDF, SpatialPandas) or DataArray/DataSet (Xarray), and then pass it to the appropriate `ds.Canvas` method call, as illustrated in the various examples in the user guide and topics. \n",
"\n",
"Be sure to consider whether a different `ds.Canvas` method might work for your data, if the obvious one does not support the features you want. For instance, the `Canvas.quadmesh` glyph for irregularly spaced raster-like data also accepts regular rasters as a special case, and it includes optimization for that special case so that it can process such data efficiently on CPUs and GPUs. So if the features provided by `quadmesh` are sufficient for your raster processing, you may be able to use that method to get faster results on GPUs, which are not currently supported by `Canvas.raster`."
]
Expand Down

0 comments on commit 4be2e0e

Please sign in to comment.