Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data API #269

Closed
philippjfr opened this issue Sep 14, 2015 · 11 comments
Closed

Data API #269

philippjfr opened this issue Sep 14, 2015 · 11 comments
Labels
type: feature A major new feature
Milestone

Comments

@philippjfr
Copy link
Member

Now that we've made HoloViews plotting backend independent it's time to talk about generalizing the data backends. Reviewing the current situation I have to say we haven't done quite as good a job as we could have done but all the foundations are there to unify things very nicely and iron out kinks like the lack of support for categorical or date types in various charts. As part of the dataframe refactor we've already begun looking at this but it's been more of an ad-hoc way of patching in support for dataframes.

Today it occurred to me that we already have three different storage formats for table-like data and a lot of code built up to convert and operate on the different data types. They include:

  1. NdElement implements a table as an NdMapping, i.e. a fancy dictionary. For large tables this is very slow.
  2. Chart objects while never displayed as such are effectively multiple columns stored in an NxD numpy array, where N is the number of samples and D the number of dimensions/columns.
  3. DFrame as it's name suggests simply uses a pandas dataframe as the underlying datasource.

It would be nice if all subclasses of these Element types could be unified to use any of these datasources. Currently different Element types use different data formats and it's all a bit of a mess and not very flexible. If we are successful with this the following Elements will interchangeably support all three data formats: Curve, Points, VectorField, Bars, Table, ErrorBars, Spread, Scatter, Scatter3D, Distribution, TimeSeries and with some minor adjustments Histogram.

I'm not entirely certain about the mechanism yet but I envision that a we introduce a new wrapper class which dispatches any method calls to one of the three baseclasses depending on the type of the data. We would provide a common API for each of the backends so custom Elements, plotting code and the user can access the data in very well defined ways independent of the data backend. Luckily we've generally made sure that such a common API exists even if we haven't used it consistently.

The current API for Element types is this:

  • dimension_values: Returns an array, list or pandas series of one column
  • range: Returns the minimum and maximum values along a dimension excluding NaNs.
  • select: Same as __getitem__ but works for one or multiple dimensions.
  • dframe: Converts data to a dataframe
  • collapse_data: Applies a function across a list of Element.data attributes
  • reduce: Applies a reduce function across one or multiple axes
  • sample: Returns a Table with of the samples specified as lists
  • groupby: Groups by the values along one or multiple dimensions returning an NdMapping type indexed by the grouped dimensions with Elements of the same type as values. (not implemented for Charts)
  • reindex: Reorder specified dimensions dropping unspecified dimensions. (not implemented for Charts)

First of all I would like a better name for dimension_values, it's one of the most important methods as it returns data along one column/dimension the obvious suggestion would be values but that clashes with the dict interface of NdMapping types. Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?

Secondly I would suggest a few additions to the API, mainly data conversion, which simply clone the Element but convert between the three data formats, something like:

  • .as_array
  • .as_dframe
  • .as_mapping

Conversion between Element types would now be trivial since they support the same data formats and I would also deprecate the table and dframe methods.

Finally I'd like to suggest that we also have a common NumPy like indexing interface, for the array format it would just index into the data directly, the dataframe version can just use the df.ix[idx] interface and the implementation for the NdMapping based tables is also trivial. For some operations this is going to be significantly faster than going via our value based indexing and it would be nice to do this in a backend agnostic way.

The great thing about this is that most of the implementation for each of these data backends is already there in the Chart, NdElement and DFrame classes. I think unifying them in this way will really clean up our data API and allow us to do some more powerful things with different interfaces in future. Once we've figured this out we'll have to consider how to extend it to other Element types like Paths and Raster types.

In terms of timeline, I think this could be done in one or two days of concentrated work. I think it might even make sense to merge our intermediate attempt at integrating dataframe support as soon as we're satisfied it works and then carry this refactor out for a v1.5 release.

Edit:

The API is actually a little bit larger than what I've described above, as it will necessarily include the NdMapping API:

  • keys: Returns the values of the key dimensions as a list of tuples
  • values: Returns the values of value dimensions as a list of tuples
  • items: Key and values combined
  • get: Dictionary semantics
  • pop: Dictionary semantics
  • drop_dimension: Drops specified dimension(s)/column(s)
  • add_dimension: Adds named column at specified position with supplied value(s)

While keys won't usually be helpful, items will basically be the API to convert to an NdMapping like representation, and both drop_dimension and add_dimension will be useful. Having get and pop on Chart types is somewhat annoying though.

@jlstevens
Copy link
Contributor

First, I should say that the general gist of this proposal seems like a very good idea to me. This is definitely something we want as it will make HoloViews more consistent and more powerful!

Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?

That sounds perfectly reasonable to me although I would want to check the return types stay consistent. We would add a deprecation warning to dimension_values for a while.

.as_array
.as_dframe
.as_mapping

Why not drop the as_ prefix and leave the .dframe and .table alone? Then we can just add .array.

If we use either of these suggestions, the return values better be as stated and not elements by default. The .array method better return an actual array etc. unless explicitly specified otherwise. Maybe you can get an element back with a switch?

foo.as_array()  # Return a NumPy array
foo.as_array(element=True)  # Now same type as foo

These methods should redirect to some utility class where the functionality is implemented. This reflects how I think we should tackle this problem in general:

  1. Select the allowable formats we want for the .data attribute.
  2. Use a utility class to access this .data attribute (no matter what format it is) in a consistent way within the code. The idea is to abstract away the details of the underlying format.

I firmly believe in utilities for this sort of refactor as they can be 1. understood in isolation 2. tested in isolation 3. used in isolation 4. can be extended in future (e.g xray). Personally I would create a DataConversion class containing only classmethods. I suspect many of these classmethods would take a data format as a first argument (e.g self.data in many cases) to return the data in a known format regardless of what the input format was.

Some things I would expect from the conversion utility:

  1. Code paths using isinstance to keep things fast. E.g if a dframe is required and the input is already in this format, use just the pandas API necessary to do the conversion efficiently and return the result.
  2. Support the NumPy indexing API you suggest above as I agree this would help make life easier.
  3. Support for taking in HoloViews elements as input and returning one of the acceptable .data formats. This would be used in the constructors of Elements to support casting between elements.
  4. We might want a classmethod that takes in the input to an Element constructor, the kdims and vdims and returns the appropriate Dimension objects. Again this would help with casting between elements.

If done properly, we could eventually get rid of the table conversion utilities (deprecate them gradually) because we would be able to immediately cast any element into any other type! In addition we would get the benefits of supporting new data formats (heterogeneous types, speed improvements etc).

@jlstevens jlstevens added the type: feature A major new feature label Sep 17, 2015
@philippjfr
Copy link
Member Author

So I thought about the issue of inheriting the entire NdMapping API again and I've come to the conclusion that the baseclass should not itself inherit from NdMapping and hold an OrderedDict. Instead the data may optionally be an NdElement (the current Table baseclass) type. This means the data of this baseclass is either an array, a dataframe, or an NdElement. This would break old pickles containing Tables but a small legacy wrapper in __setstate__ could fix this.

@jbednar
Copy link
Member

jbednar commented Sep 20, 2015

This sounds like a great proposal to me. Can the data can also be a Blaze
object?

On Fri, Sep 18, 2015 at 1:56 PM, Philipp Rudiger [email protected]
wrote:

So I thought about the issue of inheriting the entire NdMapping API again
and I've come to the conclusion that the baseclass should not itself
inherit from NdMapping and hold an OrderedDict. Instead the data may
optionally be an NdElement (the current Table baseclass) type. This means
the data of this baseclass is either an array, a dataframe, or an
NdElement. This would break old pickles containing Tables but a small
legacy wrapper in setstate could fix this.


Reply to this email directly or view it on GitHub
#269 (comment).

@philippjfr
Copy link
Member Author

Yes, we've already tested that a little bit on the dataframe branch. This proposal just makes sure that we don't just patch in the support all over the place and have a general API to extend in future, e.g. with x-ray.

@jlstevens
Copy link
Contributor

I've already committed some support for dask arrays to Raster on the dataframe branch. That work is related to this proposal but not part of it - as Philipp says, we want to generalize this sort of functionality correctly.

@jlstevens
Copy link
Contributor

I've talked again to Philipp and I feel it is worth summarizing our thoughts here:

  • We agree on the idea of using a utility class to provide a consistent API regardless of the underlying data structure.
  • We can update the current base classes (i,e Chart, NdElement, DFrame) to make use of these utilities.
  • Hopefully we can then eliminate some of these base classes by unifying them together which will help clean up the code.

One sticking point has been decided how on what format to use as a pure python tabular format (i.e if pandas is not available). The question has been whether Table or NdElement should be accepted as .databut we've now decided on the following scheme:

  • The .data attribute can be 1. a Numpy array 2. a pandas DataFrame 3. an NdElement 4. the .data attribute of whatever other element is passed into the constructor (i.e. back to one of the first three options).
  • Table will no longer inherit from NdElement (but it will inherit from Element at some level). This is a generalized version of Table (that also replaces DFrame) and will support the three formats mentioned above.
  • NdElement will be given a rich representation just like Table (i.e. it will inherit from Tabular) but unlike Table it will not support DataFrame or numpy arrays in the constructor. It will be the version of Table that does not require third-party libraries (i.e pure python) and will have no subclasses (Bars will be updated accordingly).
  • NdElement if unified with the other base classes as proposed above should be renamed. We've proposed the name Columnar (or Columns?) as that is the main feature of data, no matter the details of the data structure used. I think I vote for Columns as it will be a proper element and should be a noun.
  • We may want to consider an analogue class (ArrayLike?) for Raster types that support numpy and dask arrays.

This approach avoids unnecessary nesting of the data held by the .data attribute and resolves the issue of how Table can work without pandas being available (which would be an issue if Table were allowed to be .data).

There is a lot to think about and we will need to make sure our API abstracts over the different data types while remaining rich enough to be useful in practice (e.g for implementing operations).

@jbednar
Copy link
Member

jbednar commented Sep 22, 2015

Sounds good. I too prefer Columns to Columnar.

Will it be clear how to add other .data formats in the future, apart from the three above? And is option 1 really required to be a numpy array, or can it be anything supporting a numpy array's interface? E.g. blaze tries to support that.

@jlstevens
Copy link
Contributor

Will it be clear how to add other .data formats in the future, apart from the three above?

Yes, there will be an Interface base class that determines what needs to be implemented e.g for xray or maybe unit arrays (from astropy).

can it be anything supporting a numpy array's interface? blaze tries to support that.

From my understanding, you need blaze+dask for out-of-core arrays. I have a prototype of this but unfortunately, it isn't quite a numpy interface - if you have a dask array you need to call the .compute method to load the array-like object from disk. See here for the prototype utility I was using to handle this.

In other words, we can support dask arrays as .data with the proposed system, but there would need to be some logic (hopefully very little!) to call compute in the appropriate interface class.

Finally, I am going to assign this to the 1.4.0 milestone as I think this refactor will greatly expand the generality and power of HoloViews. It should be possible to make all this work without breaking pickles (very important!) and I don't think we should delay getting this implemented.

Edit: Using dask arrays by calling compute early on in the interface class would be an easy thing to do, but this doesn't use dask very well. A proper Interface class would be a better idea that uses the dask API to make use of all the efficient out-of-core computation it offers.

@jlstevens jlstevens added this to the v1.4.0 milestone Sep 24, 2015
@jlstevens
Copy link
Contributor

It is also worth mentioning that Philipp and I agree that these Interface classes would be a very natural place to add unit support at some point. We should keep this in mind as an additional goal for this system that we will want to think about later on.

@philippjfr philippjfr mentioned this issue Oct 18, 2015
42 tasks
@philippjfr
Copy link
Member Author

This has just been merged, closing the issue.

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: feature A major new feature
Projects
None yet
Development

No branches or pull requests

3 participants