Support for multidimensional dtypes #3443

alexbw · 2013-04-24T13:49:52Z

With 0.11 out, Pandas supports more dtypes than before, which is very useful to us science folks. However, some data is intrinsically multi-dimensional, high enough dimensional so that using labels on columns is impractical (for instance, images).

I understand DataFrames or Panels are usually the recommended panacea for this problem. This works if the datatype doesn't have any annotation. For instance, for each frame of a video, I have electrophysiology traces, timestamps, and environmental variables measured.

I have a working solution where I explicitly separate out the non-scalar data from the scalar data. I use Pandas exclusively for the scalar data, and then a dictionary of multi-D arrays for the array data.

What is the work and overhead involved in supporting multi-D data types? I would love to keep my entire ecosystem in Pandas, as it's much faster and richer than just NumPy data wrangling.

See below for the code that I hope is possible to run, with fixes.

If you can point me to a place in the codebase where I can tinker, that would also be much appreciated.

import pandas as pd
mydtype=np.dtype('(3,3)f4')
pd.Series(np.zeros(3,), dtype=mydtype)

Exception: Data must be 1-dimensional

The text was updated successfully, but these errors were encountered:

ghost · 2013-04-24T14:29:22Z

My suspicion is that you're going for more complex instead of simplifying.
However, there is the NDpanel if you want it, and you can do:

In [26]: import pandas as pd
    ...: f=lambda : np.random.random((3,3))
    ...: s=pd.Series([f() for i in range(10)], dtype='O')

In [27]: s.iloc[0]
Out[27]: 
array([[ 0.90986552,  0.30234529,  0.98927833],
       [ 0.40467537,  0.17912555,  0.06101674],
       [ 0.6623446 ,  0.69192764,  0.39398118]])

jreback · 2013-04-24T14:32:26Z

@alexbw but keep in mind, this is really not efficient (in numpy or pandas), as this is now an object array, which is not able to vectorize things. Your are much better off keeping your scalar data from your images. (I believe we went thru an exercise in saving these to/from HDF a while ago).

alexbw · 2013-04-24T14:48:04Z

Yep, your suggestions are now in production here, and it's working fine
keeping scalars and higher-D arrays separate.
Just checking in to see if any of the dtype improvements might make my
use-case a little more feasible. Doesn't seem so.

On Wed, Apr 24, 2013 at 10:32 AM, jreback [email protected] wrote:

@alexbw https://github.com/alexbw but keep in mind, this is really not
efficient (in numpy or pandas), as this is now an object array, which is
not able to vectorize things. Your are much better off keeping your scalar
data from your images. (I believe we went thru an exercise in saving these
to/from HDF a while ago).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-16934447
.

jreback · 2013-04-24T14:52:14Z

I don't know if i pointed this out before, this might work for you: http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panel4d-experimental (again the data doesn't have to be homogenous in dtype, but homogeneous in shape, which may or may not help)

alexbw · 2013-04-24T14:58:26Z

Each data stream (image, velocity, temperature) is homogeneous within
itself, but they're all different sizes. That's the clincher, and it seems
like it's not on the horizon to be supported here. Blaze seems to be going
in this direction, supporting heterogeneous data shapes.

On Wed, Apr 24, 2013 at 10:52 AM, jreback [email protected] wrote:

I don't know if i pointed this out before, this might work for you:
http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panel4d-experimental(again the data doesn't have to be homogenous in dtype, but homogeneous in
shape, which may or may not help)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-16935832
.

jreback · 2013-04-24T15:03:29Z

heteregoenous data shapes are non-trivial. Blaze does seem headed in that direction, but not sure when will happen.

alexbw · 2013-04-24T15:07:40Z

Ok. And here, by "non-trivial", do you mean that Pandas has no plans to
support a feature like this?

On Wed, Apr 24, 2013 at 11:03 AM, jreback [email protected] wrote:

heteregoenous data shapes are non-trivial. Blaze does seem headed in that
direction, but not sure when will happen.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-16936562
.

jreback · 2013-04-24T15:30:25Z

what I mean is to be efficient about ti you would have to have a structure that is essentially a dictionary of 'stuff', where stuff could be heteregenous shaped. This is really best done in a class, and it application specific. You can do it with a frame/panel whatever as @y-p shows above, but it is not 'effeicient', in that numpy holds onto the dtype as 'object'.

When I say efficient I mean that you can move operations down to a lower level (c-level) in order to do things. I am not even sure Blaze will do this, its really quite specific (they are about supporting chunks and operating on those, but those chunks are actually the same shapes, except for 1 dim where the chunking occurs).

There is a tradeoff in code complexity & runtime-efficiency & generality. You basically have to choose where you are on that 3-d surface. Pandas has moderate complexity & generality and high runtime-efficiency. I would say numpy is lower complexity, lower generality with a simiilar runtime-efficiency. I would guess that Blaze is going to be more complex, higher efficiiency (in cases of out-of-core datasets) and about the same generality as numpy (as they are aiming to replace numpy)

So even if someone had the urge to create what you are doing, they are going to have to create a new structure to hold it.

It comes to do what are your bottlenecks, maybe getting more specific will help

cpcloud · 2013-04-28T18:04:49Z

@jreback Just out of curiosity what is the long term goal of pandas in this vein? If blaze is to replace numpy will pandas diverge from numpy altogether, or will it use blaze as the backend? I see talk about making Series ind. of ndarray in the near future for pkl support and other reasons.

jreback · 2013-04-28T18:28:30Z

I don't see pandas incompatible via blaze at all. my understanding (and just from reading the blog). is. that blaze is supposed to be the next gen numpy. I think their API will necessarily be very similar to what it is now, and thus be pretty transparent to pandas.

my concerns now are availability and esp compatibility of their product, as it has a fairly complicated build scheme.
In addition they seem to want to incorporate index like features (kind of like labelled arrays). if they do great. I am sure pandas might use some of the infrastructure

I think pandas fills a somewhat higher level view of data right now (and will continue to do so)

as far as your specific comments, I have pushed a PR to decouple series from ndarray (index also needs this addessed). thus pandas will be somewhat easier to modify its backend w/o front end (API) visibility. (so this is a good thing)

supporting arbitrary dshapes within pandas existing objects IMHO is not that useful right now

jreback · 2013-04-28T18:50:55Z

@wesm chime in?

alexbw · 2013-08-15T18:54:30Z

Any chance at all of this seeing some love?

cpcloud · 2013-08-15T19:18:18Z

I think something like a RelationalDataFrame or RelationalSomethingOrOther would be useful here. The idea would be to have a collection of NDFrames that share one or more common axes (Index objects). That way you could keep things separate without the complexity of nd-dtypes. Of course, you now have the complexity of an in-memory relational database.

In this case you could have an object where all objects share the "video frame axis" possibly more if needed.

maybe it could have a query method similar to 0.13 query method that does something similar except the namespace would be expanded to include all the objects on the RelationalThingaMaBob.

jreback · 2013-08-15T19:29:10Z

@alexbw this is pretty non-trivial, mainly because numpy doesn't support it (ATM), though Blaze is suppopsed to.

@cpcloud has a nice idea, essentially an object to hold DataFrames that has axes that are alignable (think of this as a Panel like), but you could have a mixture too, e.g. for each object only align on certain axes

cpcloud · 2013-08-15T19:39:11Z

I could maybe see this being implemented by a generalization of BlockManager where the blocks themselves are pandas objects.

alexbw · 2013-08-16T18:29:22Z

@cpcloud I really like this idea of a RelationalDataFrame. Then, hopefully, we'd be able to do HDFStore-type select operations. This would have incredible power for a lot of applications in the biological and physical sciences (my field).

wesm · 2013-08-16T19:06:01Z

I have thought about the nested dtype problem and how pandas could offer a solution for that. It's tricky because it doesn't really fit with the DataFrame data model and implementation. In some sense what is needed is a more rigid table data structure that sits someplace in between NumPy structured arrays and DataFrame. I have actually been building something like this in recent months but I will not be able to release the source code for a while.

cpcloud · 2013-08-16T19:09:17Z

@wesm torture!

alexbw · 2013-08-16T20:59:38Z

@wesm Looking forward to it, when it's ready.

On Fri, Aug 16, 2013 at 3:09 PM, Phillip Cloud [email protected]:

@wesm https://github.com/wesm torture!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-22787295
.

alexbw · 2013-10-22T20:44:55Z

Any thoughts on this, @cpcloud ?

shoyer · 2014-08-15T07:27:57Z

@alexbw You should check out our project xray, which has a Dataset object that is basically @cpcloud's RelationalDataFrame -- a bunch of multi-dimensional labeled arrays aligned along any number of common axes.

Our goal is pandas-like structures for N-dimensional data, though I should note that our approach avoids heterogeneous arrays and nested dtypes (way too complex in my opinion). Instead, you would make a bunch of homogeneous arrays with different sizes and put them in a Dataset.

tangobravo · 2014-11-23T19:24:58Z

I'm a little confused by this ticket, but I think it's the right one for my issue. I'd really like to have a column in my data frame that represents, say, a 2D position or an affine matrix (ie 2x2). I like Pandas for the nice joining and selection operations but it seems weird to me that DataFrame is not able to simply wrap a numpy structured array and offer that stuff on top.

Obviously for low-dimensional stuff I could always split the elements into separate series, but then would need to join them back together again for certain uses. I've played with h5py which is able to represent the data how I'd like as a structured numpy array, but it's frustrating I can't just construct a pandas DataFrame from that directly.

It seems to me that all of the pandas-level operations don't need to care that the dtype is not a scalar, all of the indexing/slicing/joining etc just needs to treat them as "values" in the series but maybe I'm missing something fundamental. I haven't got very deep in pandas yet and am still reviewing the docs so I'd appreciate a pointer if I'm missing something obvious.

shoyer · 2014-11-24T00:03:01Z

The problem @alexbw ran into at the first post here is that numpy (as far I can tell) is not good about maintaining distinctions between multi-dimensional arrays and structured dtypes, i.e., np.zeros(2, dtype=np.dtype('(2,2)f8')) produces the exact same array as np.zeros((2,2,2)) (as far as I can tell).

@tangobravo Pandas actually does allow you to put some structured dtypes in a series and do (at least some) basic alignment/indexing. For example:

>>> x = np.zeros(2, dtype='float, (2, 2)float')
>>> y = pd.Series(x, index=['a', 'b'])
>>> y.loc['a']
(0.0, [[0.0, 0.0], [0.0, 0.0]])

That said, you'll quickly run into lots of issues -- for example, repr(y) gives an error. Unfortunately, it's not so easy for pandas to be agnostic about the dtype of an ndarray. There are lots of operations where structured dtypes could really throw things off (e.g., handling missing values). If you really want to work on this, I expect patches would be accepted, but I don't think it would be a good idea for the pandas maintainers to take on responsibility for ensuring things with structured dtypes don't break again.

So, I would suggest either (1) putting your sub arrays in 1-d arrays with dtype=object (this works with pandas) or (2) trying a package like my project xray which has its own n-dimensional series and dataframe like types.

tangobravo · 2014-11-24T15:49:40Z

@shoyer Thanks for the reply, and thanks for the example actually getting a structured dtype into a Series. There is obviously more complication that I realised in supporting this directly. I've had a quick look into xray and that certainly seems like a good solution for adding a bit more structure to n-d data.

Also apologies if my post came across as harsh, I really appreciate all the work done on pandas and it's a huge help in my work even without n-d "columns"!

shoyer · 2014-11-24T15:55:51Z

@tangobravo You also might take a look at astropy, which has its own table type that apparently allows for multi-dimensional columns. But I haven't tested it myself.

alexbw · 2014-12-23T12:39:44Z

Just wanted to give a follow-up on how I've dealt with this. I had two problems

Efficiently store and retrieve large, structured datasets. Some aspects of the dataset are scalar (e.g. velocity at some timepoint), others are inherently multi-dimensional (e.g. an image at some timepoint). All share the same time index.
Manipulate large structured datasets in memory, for analysis and plotting purposes.

I ended up explicitly writing to an HDF5 file using h5py for issue 1. The code ended up being a lot tighter than I had expected. In hindsight, I should have ditched Pandas' to_hdf function early on when I realized my requirements were out of Pandas' scope. By being explicit about how my data is structured in the HDF5 file, I can also take better advantage of compression. My saved files are an order of magnitude smaller, and reading and writing is much faster as well.

For issue 2, I ended up just using a dictionary of arrays. It sounds primitive, but I really didn't end up needing Pandas powerful pivoting, imputation and indexing features for this project. To get the convenience of the dot-syntax (e.g. df.velocity as opposed to df['velocity'], which is a huge boon when working interactively in the IPython notebook), I cobbled together this class, which just exposes dictionary elements as dot-gettable properties.

class Bunch(dict):
    def __init__(self, *args, **kw):
        dict.__init__(self, kw)
        self.__dict__ = self
        if len(args) > 0:
            assert len(args) == 1 and isinstance(args[0], dict), "Can either pass in a dictionary, or keyword arguments"
            self.__dict__.update(args[0])

    def __getstate__(self):
        return self

    def __setstate__(self, state):
        self.update(state)
        self.__dict__ = self

I didn't write it, I took pieces from around the internet.

The biggest unfortunate thing right now is that I have to index the elements, I can't index the structure itself. So, I cannot do

df[index].images, I have to do df.images[index].

The former style comes in handy when you need to chop a dataset up whole-hog for train/test/validation splits.

alexbw · 2014-12-23T12:40:27Z

Also, if nobody objects, I'll close this issue. I think my original issue is solved, in that Pandas will not support arbitrary dtypes in Series.

shoyer · 2015-02-05T09:02:13Z

@alexbw I agree, I think this issue can be considered resolved -- this is not going to happen easily in pandas itself, and is probably better left to third party packages -- pandas does not need more scope keep. That said, I might leave it open if only so that something turns out when people search open GitHub issues for "multidimensional".

Thanks also for sharing your approach. I know I'm repeating myself in this issue, but I'd like to note again for the record that each of your problems is something that xray is designed to solve (though it also tries to do more). Its Dataset object acts like your Bunch (I recently added attribute-style access for variables) but it does have support for simultaneous indexing of all variables. It also supports direct output to multi-dimensional netCDF4 files with optional chunking/compression, similar to what you accomplished with h5py (netCDF4 is a subtype of HDF5 with particular metadata conventions).

alexbw · 2015-02-05T12:19:10Z

I will check out Xray. I'm current a fan of the thinness of the Bunch
approach, but a lack of a global index is annoying. I am enjoying also the
efficiency of hand-tuned HDF5 data structures, I can go way beyond what
pandas can do out of the box by paying close attention to how data is
written. I am excited to see if xray helps automate that process (it
definitely doesn't have to be as manual as I currently do it).
On Thu, Feb 5, 2015 at 4:02 AM Stephan Hoyer [email protected]
wrote:

@alexbw https://github.com/alexbw I agree, I think this issue can be
considered resolved -- this is not going to happen easily in pandas itself,
and is probably better left to third party packages -- pandas does not need
more scope keep. That said, I might leave it open if only so that something
turns out when people search open GitHub issues for "multidimensional".

Thanks also for sharing your approach. I know I'm repeating myself in this
issue, but I'd like to note again for the record that each of your problems
is something that xray https://github.com/xray/xray/ is designed to
solve (though it also tries to do more). Its Dataset object acts like
your Bunch (I recently added attribute-style access for variables) but it
does have support for simultaneous indexing of all variables. It also
supports direct output to multi-dimensional netCDF4 files with optional
chunking/compression, similar to what you accomplished with h5py (netCDF4
is a subtype of HDF5 with particular metadata conventions).

—
Reply to this email directly or view it on GitHub
#3443 (comment).

jreback · 2016-10-05T11:25:28Z

closing this, but can comment on specific uses here, for pandas 2.0 designs.

cpcloud mentioned this issue Sep 21, 2013

Allow time series of 3D vectors #4913

Closed

jreback mentioned this issue Oct 5, 2016

supported dtypes wesm/pandas2#24

Open

jreback closed this as completed Oct 5, 2016

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 5, 2016

jorisvandenbossche modified the milestones: No action, Someday Oct 5, 2016

shoyer mentioned this issue May 30, 2017

scalar_level in MultiIndex pydata/xarray#1426

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multidimensional dtypes #3443

Support for multidimensional dtypes #3443

alexbw commented Apr 24, 2013

ghost commented Apr 24, 2013

jreback commented Apr 24, 2013

alexbw commented Apr 24, 2013

jreback commented Apr 24, 2013

alexbw commented Apr 24, 2013

jreback commented Apr 24, 2013

alexbw commented Apr 24, 2013

jreback commented Apr 24, 2013

cpcloud commented Apr 28, 2013

jreback commented Apr 28, 2013

jreback commented Apr 28, 2013

alexbw commented Aug 15, 2013

cpcloud commented Aug 15, 2013

jreback commented Aug 15, 2013

cpcloud commented Aug 15, 2013

alexbw commented Aug 16, 2013

wesm commented Aug 16, 2013

cpcloud commented Aug 16, 2013

alexbw commented Aug 16, 2013

alexbw commented Oct 22, 2013

shoyer commented Aug 15, 2014

tangobravo commented Nov 23, 2014

shoyer commented Nov 24, 2014

tangobravo commented Nov 24, 2014

shoyer commented Nov 24, 2014

alexbw commented Dec 23, 2014

alexbw commented Dec 23, 2014

shoyer commented Feb 5, 2015

alexbw commented Feb 5, 2015

jreback commented Oct 5, 2016

Support for multidimensional dtypes #3443

Support for multidimensional dtypes #3443

Comments

alexbw commented Apr 24, 2013

ghost commented Apr 24, 2013

jreback commented Apr 24, 2013

alexbw commented Apr 24, 2013

jreback commented Apr 24, 2013

alexbw commented Apr 24, 2013

jreback commented Apr 24, 2013

alexbw commented Apr 24, 2013

jreback commented Apr 24, 2013

cpcloud commented Apr 28, 2013

jreback commented Apr 28, 2013

jreback commented Apr 28, 2013

alexbw commented Aug 15, 2013

cpcloud commented Aug 15, 2013

jreback commented Aug 15, 2013

cpcloud commented Aug 15, 2013

alexbw commented Aug 16, 2013

wesm commented Aug 16, 2013

cpcloud commented Aug 16, 2013

alexbw commented Aug 16, 2013

alexbw commented Oct 22, 2013

shoyer commented Aug 15, 2014

tangobravo commented Nov 23, 2014

shoyer commented Nov 24, 2014

tangobravo commented Nov 24, 2014

shoyer commented Nov 24, 2014

alexbw commented Dec 23, 2014

alexbw commented Dec 23, 2014

shoyer commented Feb 5, 2015

alexbw commented Feb 5, 2015

jreback commented Oct 5, 2016