Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a general interface for N-dimensional gridded data #542

Merged
merged 36 commits into from
Mar 14, 2016

Conversation

philippjfr
Copy link
Member

In HoloViews we now have an interface to hold data in a columnar format. This is provides a very powerful interface for some kinds of data, however when exploring dense high-dimensional arrays it is wasteful because it expands the coordinates of the key dimensions. An alternative format used by xarray, iris and in some limited ways pandas, stores the index values or coordinates (as they are sometimes called) separately from the value dimension data, which is stored as an n-dimensional array.

Instead of storing the cartesian product of all key dimension values we store only the outer indices. Often working in a Columnar format is a lot easier because merging or adding new or derived dimensions is considerably easier, however it is not only inefficient in terms of space but is also considerably slower for various operations, particularly for groupby, aggregation and reduce operations.

The proposal

The HoloViews Columns interface actually provides a very general interface to work with structured data and in theory it does not actually restrict the format of the data. In this notebook I will outline a suggestion to add additional interfaces for the Columns type, which works with N-D gridded from hereon referred to as dense data. I will set out to show that not only is this format more efficient for various operations but the implementation is actually fairly simple and fits into our current system.

The datastructure

The current Columns interfaces already have different datastructures, which all fundamentally represent an array of the shapeRow x Column, where the rows represent the total number of samples and the columns the combined key and value dimensions, this is fundamentally no different to the COO (Coordinate) sparse matrix format (except the r, c indices are actually values). The new format would provide a dense equivalent and would differ from these existing implementations in the following ways:

  1. The data would clearly distinguish between key dimensions, which provide the indices/coordinates to index into the value dimensions, and how the actual value dimensions are stored, some example data would look like this:
hv.Table({'x': range(10), 'y': range(100), 'z': np.random.rand(10, 100)},
         kdims=['x', 'y'], vdims=['z'])

Here the x and y arrays provide the indexes along the first and second axis of the z-array. Using the current formats this would have to be specified as:

xs, ys = np.meshgrid(range(10), range(100))
hv.Table({'x': xs, 'y': ys, 'z': np.random.rand(10, 100)}, kdims=['x', 'y'], vdims=['z'])

Instead of storing the cartesian product as computed by meshgrid, the internal representation stores just the outer indices. The interface then expands these indices if required (which would generally be pretty rare).

  1. There are two possible ways to represent value dimensions, either we can have one array for all value dimensions which simply stacks the arrays or we can expand the value dimensions out into the separate arrays, i.e. multiple value dimensions could be specified like this:
hv.Table({'x': range(10), 'y': range(100), 'array': np.random.rand(10, 100, 2)}, kdims=['x', 'y'], vdims=['a', 'b'])

or like this:

hv.Table({'x': range(10), 'y': range(100), 'a': np.random.rand(10, 100), 'b': np.random.rand(10, 100)},
         kdims=['x', 'y'], vdims=['a', 'b'])

This comes down to whether the interface should support heterogeneous value dimension types. The current proposal works on the first suggestion but it would be trivial to automatically expand the first format into the second format and store the value dimensions separately internally.

Pros vs Cons

Pros:

  • More memory efficient
  • Considerably faster for gridded data
  • Columns and Raster types would share the same interface, the only other data structures required would be for Annotations and Paths.
  • Avoids having two completely separate implementations for gridded data and columnar data, which is important because even our current Chart types could sometimes benefit from a denser representation.
  • Provides a template for further interfaces based on xarray and iris Cubes.

Cons:

  • Requires some changes to existing interfaces to access both the expanded and compressed formats easily.
  • Columns becomes a misnomer (not a major obstacle since renaming baseclasses is easy).
  • Gridded data is more restricted than columnar data, so value dimension indexing and swapping key and value dimensions is not directly supported.

Obstacles/Problems

  • Need to establish clear interfaces to access the dimension values both in the expanded cartesian product format and in the dense format. I would propose that dimension_values accepts a product argument (or similar) that defaults to False returning the full cartesian product by default. Additionally it would also support a flat argument defaulting to True to retain a consistent backward compatible interface.
  • Consider how to deal with Raster and Image types where the key dimension values are implicit. My current suggestion would be either to generate the coordinates in the constructor or to introduce proxy objects, which lazily compute the indexes, e.g. to describe an Image you could do something like:
hv.Image({'x': BoundCoords(-0.5, 0.5, 100), 'y': BoundCoords(-0.5, 0.5, 100), 'z': np.random.rand(100, 100)})
  • Gridded data would be more restricted than Columnar data, e.g. before slicing value dimensions or reindexing the Element it would first have to be converted to the expanded format.
  • Since the sparse and dense representation have potentially overlapping signatures I believe it should be a parameter on the Columns object, you should explicitly declare that the data you are passing in is dense and the dense backends will try to parse that data.
  • We can have as_dense and as_sparse methods that convert between dense and sparse representations. The as_sparse representation is obviously very straightforward as it's just the cartesian product of the dense representation. The as_dense implementation requires that the data has been aggregated already. After that it's most easily implemented by combining the sparse columns with a cartesian product of the key dimensions inserting NaNs for all values, aggregating, sorting and reshaping.
  • Certain methods when applied to a dense Element return a sparse representation (currently only sample would do so).

To-do list:

  • Establish API for accessing data in column based and array based formats using dimension_values (interface proposed in Columns row/column based indexing API #541 postponed).
  • Complete the interface by deciding whether to support sorting or whether dense sorting should be fixed. (Require sorted key data in constructor for now)
  • Adapt HeatMap to support both formats (postponed)
  • Decide what to do about Raster and Image key dimensions values (postponed).
  • Decide whether to implement Histogram and QuadMesh using the interface (postponed).
  • Add unit tests
  • Update docstrings

Notebook with examples and profiling

@jlstevens
Copy link
Contributor

I think this will be great!

Just to summarize the discussion I've just had with Philipp for future reference:

  • I think instead of product the new keyword argument in dimension_values could be called expanded or something similar.
  • I'm not so sure about the 'array' key for specifying vdims, mainly as numpy arrays do not allow heterogeneous types (ignoring record arrays or other fancier array implementations). We can come back to this bit later as it is optional but I can say I would prefer a key called 'vdims'.
  • I am happy with a new interface based on DictColumns which I would call GridColumns. I would want to make sure we always try the simple DictColumns interface first before GridColumns: this makes sure there is effectively one dictionary format to reason about even though there are actually two interface classes behind the scenes.
  • To keep the semantics easier to understand, the rule I would recommend is that DictColumns is used if all the array values have the same shape. Otherwise, you fall back to GridColumns as suggested in the previous point. This means if you want a Cartesian product for kdims only where all the arrays are the same shape, you would explicitly have to remove GridColumns from the interface list as a user.

Anyway, these are my general comments for now. I'm happy to review the code in more detail once you think it is ready.

@jbednar
Copy link
Member

jbednar commented Mar 9, 2016

I'm very happy with this proposal. I think it really will help HoloViews work well in a broad range of other applications, and is worth taking the effort to work on now.

@philippjfr
Copy link
Member Author

Okay as far as I can tell I'm now done with this PR. @jlstevens said he'd go through and document the class and methods so he can get a better idea about the implementation. So I'll wait on that to make any more changes, as I'm sure he'll find some further issues.

@philippjfr philippjfr added this to the 1.4.4 milestone Mar 10, 2016
@philippjfr philippjfr added type: feature A major new feature tag: API labels Mar 10, 2016
@philippjfr
Copy link
Member Author

Note that a lot of the work to allow Raster, Image, Histogram and QuadMesh types to use dense interfaces has been postponed and is not part of this PR. Hopefully for version 1.5 we can unify all these types together leaving only Path and Annotation types with custom data formats.

@jlstevens
Copy link
Contributor

I'm going to go through this PR carefully now, making sure I understand it, making comments and updating docstrings as necessary. Then once those issues are addressed I think it can be merged.

@@ -469,6 +464,11 @@ def validate(cls, columns):


@classmethod
def check_dense(cls, arrays):
return any(array.shape not in [arrays[0].shape, (1,)] for array in arrays[1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly check_compressed might be a better name...

Edit: How about inverting it and calling it expanded_format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jlstevens
Copy link
Contributor

I think I'm done making comments for now and I only have a few docstrings to update as the API is consistent with what we had before. There are a few name changes and most of what I suggest should be quite quick to fix. Once these fixes are done, I'll run flakes and update the docstrings I mentioned.

@philippjfr
Copy link
Member Author

Okay, I've gone through an made all the fixes you suggested and tests should pass in a minute. If you could go through it and add docstrings then I think this is ready to merge. Only other thing we should decide is whether to add 'grid' to the datatype list by default.

@jlstevens
Copy link
Contributor

I've had a go updating the class docstring for GridColumns and I've made a few other small changes. In the end, I've decided not to do all the docstrings now as all the methods on DataColumns should be given docstrings and there are a lot of them! DataColumns is what defines the API so that should go in a separate PR.

Once the pr tests pass, I'm happy to merge.

@jlstevens
Copy link
Contributor

Ok, the pr build is passing. Time to merge!

jlstevens added a commit that referenced this pull request Mar 14, 2016
Adding a general interface for N-dimensional gridded data
@jlstevens jlstevens merged commit d4779cb into master Mar 14, 2016
@philippjfr philippjfr deleted the dense_interface branch April 1, 2016 14:27
@philippjfr philippjfr modified the milestones: v1.5.0, 1.4.4 Apr 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag: API type: feature A major new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants