Adding a general interface for N-dimensional gridded data #542

philippjfr · 2016-03-07T22:01:29Z

In HoloViews we now have an interface to hold data in a columnar format. This is provides a very powerful interface for some kinds of data, however when exploring dense high-dimensional arrays it is wasteful because it expands the coordinates of the key dimensions. An alternative format used by xarray, iris and in some limited ways pandas, stores the index values or coordinates (as they are sometimes called) separately from the value dimension data, which is stored as an n-dimensional array.

Instead of storing the cartesian product of all key dimension values we store only the outer indices. Often working in a Columnar format is a lot easier because merging or adding new or derived dimensions is considerably easier, however it is not only inefficient in terms of space but is also considerably slower for various operations, particularly for groupby, aggregation and reduce operations.

The proposal

The HoloViews Columns interface actually provides a very general interface to work with structured data and in theory it does not actually restrict the format of the data. In this notebook I will outline a suggestion to add additional interfaces for the Columns type, which works with N-D gridded from hereon referred to as dense data. I will set out to show that not only is this format more efficient for various operations but the implementation is actually fairly simple and fits into our current system.

The datastructure

The current Columns interfaces already have different datastructures, which all fundamentally represent an array of the shapeRow x Column, where the rows represent the total number of samples and the columns the combined key and value dimensions, this is fundamentally no different to the COO (Coordinate) sparse matrix format (except the r, c indices are actually values). The new format would provide a dense equivalent and would differ from these existing implementations in the following ways:

The data would clearly distinguish between key dimensions, which provide the indices/coordinates to index into the value dimensions, and how the actual value dimensions are stored, some example data would look like this:

hv.Table({'x': range(10), 'y': range(100), 'z': np.random.rand(10, 100)},
         kdims=['x', 'y'], vdims=['z'])

Here the x and y arrays provide the indexes along the first and second axis of the z-array. Using the current formats this would have to be specified as:

xs, ys = np.meshgrid(range(10), range(100))
hv.Table({'x': xs, 'y': ys, 'z': np.random.rand(10, 100)}, kdims=['x', 'y'], vdims=['z'])

Instead of storing the cartesian product as computed by meshgrid, the internal representation stores just the outer indices. The interface then expands these indices if required (which would generally be pretty rare).

There are two possible ways to represent value dimensions, either we can have one array for all value dimensions which simply stacks the arrays or we can expand the value dimensions out into the separate arrays, i.e. multiple value dimensions could be specified like this:

hv.Table({'x': range(10), 'y': range(100), 'array': np.random.rand(10, 100, 2)}, kdims=['x', 'y'], vdims=['a', 'b'])

or like this:

hv.Table({'x': range(10), 'y': range(100), 'a': np.random.rand(10, 100), 'b': np.random.rand(10, 100)},
         kdims=['x', 'y'], vdims=['a', 'b'])

This comes down to whether the interface should support heterogeneous value dimension types. The current proposal works on the first suggestion but it would be trivial to automatically expand the first format into the second format and store the value dimensions separately internally.

Pros vs Cons

Pros:

More memory efficient
Considerably faster for gridded data
Columns and Raster types would share the same interface, the only other data structures required would be for Annotations and Paths.
Avoids having two completely separate implementations for gridded data and columnar data, which is important because even our current Chart types could sometimes benefit from a denser representation.
Provides a template for further interfaces based on xarray and iris Cubes.

Cons:

Requires some changes to existing interfaces to access both the expanded and compressed formats easily.
Columns becomes a misnomer (not a major obstacle since renaming baseclasses is easy).
Gridded data is more restricted than columnar data, so value dimension indexing and swapping key and value dimensions is not directly supported.

Obstacles/Problems

Need to establish clear interfaces to access the dimension values both in the expanded cartesian product format and in the dense format. I would propose that dimension_values accepts a product argument (or similar) that defaults to False returning the full cartesian product by default. Additionally it would also support a flat argument defaulting to True to retain a consistent backward compatible interface.
Consider how to deal with Raster and Image types where the key dimension values are implicit. My current suggestion would be either to generate the coordinates in the constructor or to introduce proxy objects, which lazily compute the indexes, e.g. to describe an Image you could do something like:

hv.Image({'x': BoundCoords(-0.5, 0.5, 100), 'y': BoundCoords(-0.5, 0.5, 100), 'z': np.random.rand(100, 100)})

Gridded data would be more restricted than Columnar data, e.g. before slicing value dimensions or reindexing the Element it would first have to be converted to the expanded format.
Since the sparse and dense representation have potentially overlapping signatures I believe it should be a parameter on the Columns object, you should explicitly declare that the data you are passing in is dense and the dense backends will try to parse that data.
We can have as_dense and as_sparse methods that convert between dense and sparse representations. The as_sparse representation is obviously very straightforward as it's just the cartesian product of the dense representation. The as_dense implementation requires that the data has been aggregated already. After that it's most easily implemented by combining the sparse columns with a cartesian product of the key dimensions inserting NaNs for all values, aggregating, sorting and reshaping.
Certain methods when applied to a dense Element return a sparse representation (currently only sample would do so).

To-do list:

Establish API for accessing data in column based and array based formats using dimension_values (interface proposed in Columns row/column based indexing API #541 postponed).
Complete the interface by deciding whether to support sorting or whether dense sorting should be fixed. (Require sorted key data in constructor for now)
Adapt HeatMap to support both formats (postponed)
Decide what to do about Raster and Image key dimensions values (postponed).
Decide whether to implement Histogram and QuadMesh using the interface (postponed).
Add unit tests
Update docstrings

Notebook with examples and profiling

jlstevens · 2016-03-07T23:11:42Z

I think this will be great!

Just to summarize the discussion I've just had with Philipp for future reference:

I think instead of product the new keyword argument in dimension_values could be called expanded or something similar.
I'm not so sure about the 'array' key for specifying vdims, mainly as numpy arrays do not allow heterogeneous types (ignoring record arrays or other fancier array implementations). We can come back to this bit later as it is optional but I can say I would prefer a key called 'vdims'.
I am happy with a new interface based on DictColumns which I would call GridColumns. I would want to make sure we always try the simple DictColumns interface first before GridColumns: this makes sure there is effectively one dictionary format to reason about even though there are actually two interface classes behind the scenes.
To keep the semantics easier to understand, the rule I would recommend is that DictColumns is used if all the array values have the same shape. Otherwise, you fall back to GridColumns as suggested in the previous point. This means if you want a Cartesian product for kdims only where all the arrays are the same shape, you would explicitly have to remove GridColumns from the interface list as a user.

Anyway, these are my general comments for now. I'm happy to review the code in more detail once you think it is ready.

jbednar · 2016-03-09T02:55:38Z

I'm very happy with this proposal. I think it really will help HoloViews work well in a broad range of other applications, and is worth taking the effort to work on now.

philippjfr · 2016-03-10T15:18:30Z

Okay as far as I can tell I'm now done with this PR. @jlstevens said he'd go through and document the class and methods so he can get a better idea about the implementation. So I'll wait on that to make any more changes, as I'm sure he'll find some further issues.

philippjfr · 2016-03-10T16:12:12Z

Note that a lot of the work to allow Raster, Image, Histogram and QuadMesh types to use dense interfaces has been postponed and is not part of this PR. Hopefully for version 1.5 we can unify all these types together leaving only Path and Annotation types with custom data formats.

jlstevens · 2016-03-14T11:07:54Z

I'm going to go through this PR carefully now, making sure I understand it, making comments and updating docstrings as necessary. Then once those issues are addressed I think it can be merged.

jlstevens · 2016-03-14T11:11:51Z

holoviews/core/data.py

@@ -469,6 +464,11 @@ def validate(cls, columns):


    @classmethod
+    def check_dense(cls, arrays):
+        return any(array.shape not in [arrays[0].shape, (1,)] for array in arrays[1:])


If I understand this code correctly check_compressed might be a better name...

Edit: How about inverting it and calling it expanded_format?

jlstevens · 2016-03-14T12:28:11Z

I think I'm done making comments for now and I only have a few docstrings to update as the API is consistent with what we had before. There are a few name changes and most of what I suggest should be quite quick to fix. Once these fixes are done, I'll run flakes and update the docstrings I mentioned.

philippjfr · 2016-03-14T16:04:45Z

Okay, I've gone through an made all the fixes you suggested and tests should pass in a minute. If you could go through it and add docstrings then I think this is ready to merge. Only other thing we should decide is whether to add 'grid' to the datatype list by default.

jlstevens · 2016-03-14T17:51:03Z

I've had a go updating the class docstring for GridColumns and I've made a few other small changes. In the end, I've decided not to do all the docstrings now as all the methods on DataColumns should be given docstrings and there are a lot of them! DataColumns is what defines the API so that should go in a separate PR.

Once the pr tests pass, I'm happy to merge.

jlstevens · 2016-03-14T19:01:31Z

Ok, the pr build is passing. Time to merge!

Adding a general interface for N-dimensional gridded data

philippjfr added 3 commits March 4, 2016 14:38

Added validation method to DictColumns interface

6236444

Fix for scalar columns in DictColumns

f00c306

Added initial dense Columns interface

9589707

philippjfr added 3 commits March 8, 2016 11:32

Consistently added expanded keyword to dimension_values method

0645535

Renamed NdArrayColumns to GridColumns

ead988f

Cleanup and minor fixes in core.data module

f26d95f

philippjfr mentioned this pull request Mar 8, 2016

Added initial prototype of holocube package and notebook CubeBrowser/cube-explorer#5

Merged

5 tasks

philippjfr added 4 commits March 8, 2016 19:48

Added validation to detect dense formats in existing interfaces

03c1e22

Added default arguments to interface values method

81bb78f

Updated Column interface unit test

c5f4f61

Reverted change to DictColumns interface

3b5cada

philippjfr mentioned this pull request Mar 8, 2016

Implement data interface for Raster types #546

Open

2 tasks

Small fixes for NdElement interface

631e91c

philippjfr added 2 commits March 9, 2016 21:52

Changed GridColumns format to expand vdims

49e729a

Improved GridColumns validation

7d5dc26

philippjfr force-pushed the dense_interface branch from a5d134b to 7d5dc26 Compare March 9, 2016 23:41

philippjfr added 5 commits March 10, 2016 14:59

Fixed scalar return values from GridColumn slicing

232b372

Ensured GridColumns aggregate returns at least 1D array

d6e3440

Implemented GridColumns add_dimension and sort methods

e30bc53

Added unit tests for GridColumns interface

46c5e8a

Added missing import in core.util

034e659

philippjfr added this to the 1.4.4 milestone Mar 10, 2016

philippjfr added type: feature A major new feature tag: API labels Mar 10, 2016

jlstevens reviewed Mar 14, 2016
View reviewed changes

philippjfr added 15 commits March 14, 2016 13:30

Renamed Columns interface reshape method to init

84a3667

Removed stray GridColumns.add_dimension method

5780d1d

Renamed check_dense to expanded_format and improved validation

4d0e547

Renamed GridColumns coord_mask to key_select_mask

a38409b

Added comment for Image dimension_values method

487e667

Enforced samples have uniform length on GridColumns

5468271

Allowed returning non-flat key dimensions from gridded Elements

626dd07

Allowed dropping constant dimensions via GridColumns.reindex

b664343

Improved error message on GridColumns.sort

fb8a2f2

Disabled support for expanding vdims in GridColumns

0af0d1e

Small fixes for NdColumns and DFColumns constructors

4ff50a4

Updated GridColumns unit test

675dd4c

Updated GridColumns value slicing exception

b853d12

Fixed ArrayColumns init bug

0ffcbf2

Fixed inverted Image.dimension_values

c509a07

philippjfr force-pushed the dense_interface branch from 17afc6d to c509a07 Compare March 14, 2016 15:52

Renamed expanded_format method to expanded

2a640c6

jlstevens mentioned this pull request Mar 14, 2016

Support packed value dimensions in Grid interfaces #550

Closed

jlstevens added 2 commits March 14, 2016 17:43

Updated the class docstring for GridColumns

c109675

Added 'grid' interface to default datatype list

1b8d27a

jlstevens added a commit that referenced this pull request Mar 14, 2016

Merge pull request #542 from ioam/dense_interface

d4779cb

Adding a general interface for N-dimensional gridded data

jlstevens merged commit d4779cb into master Mar 14, 2016

philippjfr mentioned this pull request Mar 18, 2016

Various small fixes for GridColumns #559

Merged

philippjfr deleted the dense_interface branch April 1, 2016 14:27

philippjfr modified the milestones: v1.5.0, 1.4.4 Apr 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a general interface for N-dimensional gridded data #542

Adding a general interface for N-dimensional gridded data #542

philippjfr commented Mar 7, 2016

jlstevens commented Mar 7, 2016

jbednar commented Mar 9, 2016

philippjfr commented Mar 10, 2016

philippjfr commented Mar 10, 2016

jlstevens commented Mar 14, 2016

jlstevens Mar 14, 2016

philippjfr Mar 14, 2016

jlstevens commented Mar 14, 2016

philippjfr commented Mar 14, 2016

jlstevens commented Mar 14, 2016

jlstevens commented Mar 14, 2016

Adding a general interface for N-dimensional gridded data #542

Adding a general interface for N-dimensional gridded data #542

Conversation

philippjfr commented Mar 7, 2016

The proposal

The datastructure

Pros vs Cons

Obstacles/Problems

jlstevens commented Mar 7, 2016

jbednar commented Mar 9, 2016

philippjfr commented Mar 10, 2016

philippjfr commented Mar 10, 2016

jlstevens commented Mar 14, 2016

jlstevens Mar 14, 2016

Choose a reason for hiding this comment

philippjfr Mar 14, 2016

Choose a reason for hiding this comment

jlstevens commented Mar 14, 2016

philippjfr commented Mar 14, 2016

jlstevens commented Mar 14, 2016

jlstevens commented Mar 14, 2016