-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data API #269
Comments
First, I should say that the general gist of this proposal seems like a very good idea to me. This is definitely something we want as it will make HoloViews more consistent and more powerful!
That sounds perfectly reasonable to me although I would want to check the return types stay consistent. We would add a deprecation warning to
Why not drop the If we use either of these suggestions, the return values better be as stated and not elements by default. The foo.as_array() # Return a NumPy array
foo.as_array(element=True) # Now same type as foo These methods should redirect to some utility class where the functionality is implemented. This reflects how I think we should tackle this problem in general:
I firmly believe in utilities for this sort of refactor as they can be 1. understood in isolation 2. tested in isolation 3. used in isolation 4. can be extended in future (e.g xray). Personally I would create a Some things I would expect from the conversion utility:
If done properly, we could eventually get rid of the table conversion utilities (deprecate them gradually) because we would be able to immediately cast any element into any other type! In addition we would get the benefits of supporting new data formats (heterogeneous types, speed improvements etc). |
So I thought about the issue of inheriting the entire |
This sounds like a great proposal to me. Can the data can also be a Blaze On Fri, Sep 18, 2015 at 1:56 PM, Philipp Rudiger [email protected]
|
Yes, we've already tested that a little bit on the dataframe branch. This proposal just makes sure that we don't just patch in the support all over the place and have a general API to extend in future, e.g. with x-ray. |
I've already committed some support for dask arrays to Raster on the |
I've talked again to Philipp and I feel it is worth summarizing our thoughts here:
One sticking point has been decided how on what format to use as a pure python tabular format (i.e if pandas is not available). The question has been whether
This approach avoids unnecessary nesting of the data held by the There is a lot to think about and we will need to make sure our API abstracts over the different data types while remaining rich enough to be useful in practice (e.g for implementing operations). |
Sounds good. I too prefer Columns to Columnar. Will it be clear how to add other .data formats in the future, apart from the three above? And is option 1 really required to be a numpy array, or can it be anything supporting a numpy array's interface? E.g. blaze tries to support that. |
Yes, there will be an
From my understanding, you need blaze+dask for out-of-core arrays. I have a prototype of this but unfortunately, it isn't quite a numpy interface - if you have a dask array you need to call the In other words, we can support dask arrays as Finally, I am going to assign this to the 1.4.0 milestone as I think this refactor will greatly expand the generality and power of HoloViews. It should be possible to make all this work without breaking pickles (very important!) and I don't think we should delay getting this implemented. Edit: Using dask arrays by calling |
It is also worth mentioning that Philipp and I agree that these |
This has just been merged, closing the issue. |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Now that we've made HoloViews plotting backend independent it's time to talk about generalizing the data backends. Reviewing the current situation I have to say we haven't done quite as good a job as we could have done but all the foundations are there to unify things very nicely and iron out kinks like the lack of support for categorical or date types in various charts. As part of the dataframe refactor we've already begun looking at this but it's been more of an ad-hoc way of patching in support for dataframes.
Today it occurred to me that we already have three different storage formats for table-like data and a lot of code built up to convert and operate on the different data types. They include:
NdElement
implements a table as an NdMapping, i.e. a fancy dictionary. For large tables this is very slow.Chart
objects while never displayed as such are effectively multiple columns stored in an NxD numpy array, where N is the number of samples and D the number of dimensions/columns.DFrame
as it's name suggests simply uses a pandas dataframe as the underlying datasource.It would be nice if all subclasses of these Element types could be unified to use any of these datasources. Currently different Element types use different data formats and it's all a bit of a mess and not very flexible. If we are successful with this the following Elements will interchangeably support all three data formats: Curve, Points, VectorField, Bars, Table, ErrorBars, Spread, Scatter, Scatter3D, Distribution, TimeSeries and with some minor adjustments Histogram.
I'm not entirely certain about the mechanism yet but I envision that a we introduce a new wrapper class which dispatches any method calls to one of the three baseclasses depending on the type of the data. We would provide a common API for each of the backends so custom Elements, plotting code and the user can access the data in very well defined ways independent of the data backend. Luckily we've generally made sure that such a common API exists even if we haven't used it consistently.
The current API for Element types is this:
dimension_values
: Returns an array, list or pandas series of one columnrange
: Returns the minimum and maximum values along a dimension excluding NaNs.select
: Same as__getitem__
but works for one or multiple dimensions.dframe
: Converts data to a dataframecollapse_data
: Applies a function across a list of Element.data attributesreduce
: Applies a reduce function across one or multiple axessample
: Returns a Table with of the samples specified as listsgroupby
: Groups by the values along one or multiple dimensions returning an NdMapping type indexed by the grouped dimensions with Elements of the same type as values. (not implemented for Charts)reindex
: Reorder specified dimensions dropping unspecified dimensions. (not implemented for Charts)First of all I would like a better name for
dimension_values
, it's one of the most important methods as it returns data along one column/dimension the obvious suggestion would bevalues
but that clashes with the dict interface of NdMapping types. Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?Secondly I would suggest a few additions to the API, mainly data conversion, which simply clone the Element but convert between the three data formats, something like:
Conversion between Element types would now be trivial since they support the same data formats and I would also deprecate the
table
anddframe
methods.Finally I'd like to suggest that we also have a common NumPy like indexing interface, for the array format it would just index into the data directly, the dataframe version can just use the df.ix[idx] interface and the implementation for the NdMapping based tables is also trivial. For some operations this is going to be significantly faster than going via our value based indexing and it would be nice to do this in a backend agnostic way.
The great thing about this is that most of the implementation for each of these data backends is already there in the
Chart
,NdElement
andDFrame
classes. I think unifying them in this way will really clean up our data API and allow us to do some more powerful things with different interfaces in future. Once we've figured this out we'll have to consider how to extend it to other Element types like Paths and Raster types.In terms of timeline, I think this could be done in one or two days of concentrated work. I think it might even make sense to merge our intermediate attempt at integrating dataframe support as soon as we're satisfied it works and then carry this refactor out for a v1.5 release.
Edit:
The API is actually a little bit larger than what I've described above, as it will necessarily include the NdMapping API:
While
keys
won't usually be helpful,items
will basically be the API to convert to anNdMapping
like representation, and bothdrop_dimension
andadd_dimension
will be useful. Having get and pop on Chart types is somewhat annoying though.The text was updated successfully, but these errors were encountered: