Handling pandas Series and single column DataFrames #2164

philippjfr · 2017-12-01T03:36:09Z

This PR adds support for handling pandas and dask Series and single column DataFrames by resetting their index, turning it into a dimension. It should not have any backwards compatibility issues, except that constructing a Dataset from a single column DataFrame will now use the index and make it a dimension, while previously it would have just had the single column as a dimension. In all other cases it would have previously errored.

jbednar · 2017-12-01T03:59:01Z

This is really helpful, thanks! Given that the output of so many Pandas operations is a single-column dataframe or series with an index, we can now plot these without having to reset the index or specify the dimensions explicitly.

philippjfr · 2017-12-04T00:41:49Z

So I've got some ideas on how to improve this a bit further. I'd suggest that if the user references an index as a key dimension by name we could also reset the index so it can be used by the interface. For dataframes with an index column with no name you should also simply be able to reference it by supplying 'index' as the key dimension.

So you might do something like this, a dataframe with three columns of which you want to plot one:

df = pd.DataFrame({'A': range(10), 'B': np.random.rand(10), 'C': np.random.randn(10)})
hv.Curve(df, 'index', 'A')

This also got me thinking about "wide" datasets and what the best way would be to let the user get them into HoloViews. I feel fairly strongly that data should generally be tidy when working with HoloViews, since each column in a wide dataset does not correspond to a value dimension, instead the column names represent values along a secondary dimension. I'd therefore suggest we provide an easy way to "melt" a dataset. pd.melt provides very useful functionality to make datasets tidy, but it's always confused me. By adding a simple .melt method to datasets I think we could make things much more intuitive, e.g. take the example above:

df = pd.DataFrame({'A': range(10), 'B': np.random.rand(10), 'C': np.random.randn(10)})
hv.Curve(df, 'index').melt()

It's gone from a wide to a tiny format and given the value dimension the name 'Value' and the columns have become individual Curve indexed by the 'Group' dimension. Now let's look at a more realistic example, here's a mock stock timeseries datasets with three columns:

df = pd.DataFrame(np.random.randn(30, 3), columns=['MSFT', 'IBM', 'AAPL'],
                  index=pd.date_range('2017-01-01', '2017-01-30', name='Date')).cumsum()

hv.Curve(df, 'Date').melt('Close ($)', 'Stock')

This time we've given the index, value and group dimensions actual names.

jbednar · 2017-12-07T20:44:43Z

I'd suggest that if the user references an index as a key dimension by name we could also reset the index so it can be used by the interface. For dataframes with an index column with no name you should also simply be able to reference it by supplying 'index' as the key dimension.

Sounds good!

I'd therefore suggest we provide an easy way to "melt" a dataset.

Your examples look good, and I think they will help address #2162. But it's hard for me to see what hv.Curve(df, 'Date') would do before the melting; is that a meaningful object? I was expecting to see you propose for .melt to be a method on hv.Dataset, not on hv.Curve.

philippjfr · 2017-12-07T20:53:01Z

But it's hard for me to see what hv.Curve(df, 'Date') would do before the melting; is that a meaningful object?

It would display just the first column, so I do think it's meaningful. It's even useful for gridded data where you might have multiple value arrays in your xr.Dataset.

hv.Image(xrds).melt()

I was expecting to see you propose for .melt to be a method on hv.Dataset, not on hv.Curve.

(Almost) all elements except annotations are Dataset subclasses, so yes it is a method on Dataset and Curve.

jbednar · 2017-12-07T21:13:26Z

Hmm. It seems more natural to me to do that sort of data manipulation before the object is visualizable, but I'm not sure. Can you make an example where you melt an hv.Dataset, then display it as a table, and only then make a Curve out of it, so that it's clear what the melting did?

philippjfr · 2017-12-07T23:53:12Z

These two would be equivalent:

hv.Dataset(df, 'index').melt().to(hv.Curve, 'index').overlay()

and:

hv.Curve(df, 'index').melt().overlay()

jbednar · 2017-12-08T00:17:54Z

Makes sense, thanks! Might be good to introduce it that way in the docs so that it's clear what it's doing, then point out that the second way is more convenient and does the same thing. Shouldn't melt() just assume that it should use the index column as the kdim, as the default?

jlstevens · 2017-12-12T12:05:30Z

I definitely support this feature though I would need to think carefully about how intuitive this behavior would be. The goal is to have things 'just work' without messing around people who might be making certain assumptions about the index on their dataframes.

Whatever we do, we need a new user guide talking about tidy vs wide formats and the melt method when it is introduced.

philippjfr · 2017-12-14T21:51:28Z

The goal is to have things 'just work' without messing around people who might be making certain assumptions about the index on their dataframes.

More complex and automatic handling of pandas indexes is going to have to wait a while longer as it has potential backward compatibility implications. For now this will allow using an index if requested as kdim and automatically using it only in cases where it would have errored before. I'd be happy to rework the existing user guides with material on working with different datasets.

philippjfr · 2018-02-03T17:49:20Z

Let's deal with my suggestion for a melt method in another PR. For now I think we should merge this immediately.

jlstevens · 2018-02-05T12:14:40Z

Looks good and I agree it should be merged ASAP, just to get some real world testing.

It would be good to update the docs about this in the PR implementing the melt method.

Merging.

github-actions · 2024-10-25T06:35:37Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

philippjfr added tag: component: data type: feature A major new feature labels Dec 1, 2017

philippjfr added this to the v1.10 milestone Dec 1, 2017

philippjfr mentioned this pull request Dec 6, 2017

Document differences in dimension handling needed for tidy vs. wide tabular datasets #2162

Open

philippjfr force-pushed the pandas_index branch from fe57f1a to b5f67e1 Compare December 15, 2017 13:37

philippjfr added 2 commits February 3, 2018 17:48

Handling pandas Series and single column DataFrames

b3c8c4c

Handle explicitly referenced pandas index in Dataset constructor

fd840b5

philippjfr force-pushed the pandas_index branch from b5f67e1 to fd840b5 Compare February 3, 2018 17:49

jlstevens merged commit 9636446 into master Feb 5, 2018

philippjfr mentioned this pull request Feb 5, 2018

reading DataFrame index as kdim or vdim in DataSet #2000

Closed

philippjfr deleted the pandas_index branch February 11, 2018 17:02

github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling pandas Series and single column DataFrames #2164

Handling pandas Series and single column DataFrames #2164

philippjfr commented Dec 1, 2017 •

edited by jbednar

Loading

jbednar commented Dec 1, 2017

philippjfr commented Dec 4, 2017 •

edited

Loading

jbednar commented Dec 7, 2017 •

edited

Loading

philippjfr commented Dec 7, 2017 •

edited by jbednar

Loading

jbednar commented Dec 7, 2017

philippjfr commented Dec 7, 2017

jbednar commented Dec 8, 2017 •

edited

Loading

jlstevens commented Dec 12, 2017

philippjfr commented Dec 14, 2017 •

edited

Loading

philippjfr commented Feb 3, 2018

jlstevens commented Feb 5, 2018

github-actions bot commented Oct 25, 2024

Handling pandas Series and single column DataFrames #2164

Handling pandas Series and single column DataFrames #2164

Conversation

philippjfr commented Dec 1, 2017 • edited by jbednar Loading

jbednar commented Dec 1, 2017

philippjfr commented Dec 4, 2017 • edited Loading

jbednar commented Dec 7, 2017 • edited Loading

philippjfr commented Dec 7, 2017 • edited by jbednar Loading

jbednar commented Dec 7, 2017

philippjfr commented Dec 7, 2017

jbednar commented Dec 8, 2017 • edited Loading

jlstevens commented Dec 12, 2017

philippjfr commented Dec 14, 2017 • edited Loading

philippjfr commented Feb 3, 2018

jlstevens commented Feb 5, 2018

github-actions bot commented Oct 25, 2024

philippjfr commented Dec 1, 2017 •

edited by jbednar

Loading

philippjfr commented Dec 4, 2017 •

edited

Loading

jbednar commented Dec 7, 2017 •

edited

Loading

philippjfr commented Dec 7, 2017 •

edited by jbednar

Loading

jbednar commented Dec 8, 2017 •

edited

Loading

philippjfr commented Dec 14, 2017 •

edited

Loading