Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling pandas Series and single column DataFrames #2164

Merged
merged 2 commits into from
Feb 5, 2018
Merged

Conversation

philippjfr
Copy link
Member

@philippjfr philippjfr commented Dec 1, 2017

This PR adds support for handling pandas and dask Series and single column DataFrames by resetting their index, turning it into a dimension. It should not have any backwards compatibility issues, except that constructing a Dataset from a single column DataFrame will now use the index and make it a dimension, while previously it would have just had the single column as a dimension. In all other cases it would have previously errored.

@philippjfr philippjfr added this to the v1.10 milestone Dec 1, 2017
@jbednar
Copy link
Member

jbednar commented Dec 1, 2017

This is really helpful, thanks! Given that the output of so many Pandas operations is a single-column dataframe or series with an index, we can now plot these without having to reset the index or specify the dimensions explicitly.

@philippjfr
Copy link
Member Author

philippjfr commented Dec 4, 2017

So I've got some ideas on how to improve this a bit further. I'd suggest that if the user references an index as a key dimension by name we could also reset the index so it can be used by the interface. For dataframes with an index column with no name you should also simply be able to reference it by supplying 'index' as the key dimension.

So you might do something like this, a dataframe with three columns of which you want to plot one:

df = pd.DataFrame({'A': range(10), 'B': np.random.rand(10), 'C': np.random.randn(10)})
hv.Curve(df, 'index', 'A')

This also got me thinking about "wide" datasets and what the best way would be to let the user get them into HoloViews. I feel fairly strongly that data should generally be tidy when working with HoloViews, since each column in a wide dataset does not correspond to a value dimension, instead the column names represent values along a secondary dimension. I'd therefore suggest we provide an easy way to "melt" a dataset. pd.melt provides very useful functionality to make datasets tidy, but it's always confused me. By adding a simple .melt method to datasets I think we could make things much more intuitive, e.g. take the example above:

df = pd.DataFrame({'A': range(10), 'B': np.random.rand(10), 'C': np.random.randn(10)})
hv.Curve(df, 'index').melt()

image

It's gone from a wide to a tiny format and given the value dimension the name 'Value' and the columns have become individual Curve indexed by the 'Group' dimension. Now let's look at a more realistic example, here's a mock stock timeseries datasets with three columns:

df = pd.DataFrame(np.random.randn(30, 3), columns=['MSFT', 'IBM', 'AAPL'],
                  index=pd.date_range('2017-01-01', '2017-01-30', name='Date')).cumsum()

hv.Curve(df, 'Date').melt('Close ($)', 'Stock')

screen shot 2017-12-04 at 12 42 07 am

This time we've given the index, value and group dimensions actual names.

@jbednar
Copy link
Member

jbednar commented Dec 7, 2017

I'd suggest that if the user references an index as a key dimension by name we could also reset the index so it can be used by the interface. For dataframes with an index column with no name you should also simply be able to reference it by supplying 'index' as the key dimension.

Sounds good!

I'd therefore suggest we provide an easy way to "melt" a dataset.

Your examples look good, and I think they will help address #2162. But it's hard for me to see what hv.Curve(df, 'Date') would do before the melting; is that a meaningful object? I was expecting to see you propose for .melt to be a method on hv.Dataset, not on hv.Curve.

@philippjfr
Copy link
Member Author

philippjfr commented Dec 7, 2017

But it's hard for me to see what hv.Curve(df, 'Date') would do before the melting; is that a meaningful object?

It would display just the first column, so I do think it's meaningful. It's even useful for gridded data where you might have multiple value arrays in your xr.Dataset.

hv.Image(xrds).melt()

I was expecting to see you propose for .melt to be a method on hv.Dataset, not on hv.Curve.

(Almost) all elements except annotations are Dataset subclasses, so yes it is a method on Dataset and Curve.

@jbednar
Copy link
Member

jbednar commented Dec 7, 2017

Hmm. It seems more natural to me to do that sort of data manipulation before the object is visualizable, but I'm not sure. Can you make an example where you melt an hv.Dataset, then display it as a table, and only then make a Curve out of it, so that it's clear what the melting did?

@philippjfr
Copy link
Member Author

These two would be equivalent:

hv.Dataset(df, 'index').melt().to(hv.Curve, 'index').overlay()

and:

hv.Curve(df, 'index').melt().overlay()

@jbednar
Copy link
Member

jbednar commented Dec 8, 2017

Makes sense, thanks! Might be good to introduce it that way in the docs so that it's clear what it's doing, then point out that the second way is more convenient and does the same thing. Shouldn't melt() just assume that it should use the index column as the kdim, as the default?

@jlstevens
Copy link
Contributor

I definitely support this feature though I would need to think carefully about how intuitive this behavior would be. The goal is to have things 'just work' without messing around people who might be making certain assumptions about the index on their dataframes.

Whatever we do, we need a new user guide talking about tidy vs wide formats and the melt method when it is introduced.

@philippjfr
Copy link
Member Author

philippjfr commented Dec 14, 2017

The goal is to have things 'just work' without messing around people who might be making certain assumptions about the index on their dataframes.

More complex and automatic handling of pandas indexes is going to have to wait a while longer as it has potential backward compatibility implications. For now this will allow using an index if requested as kdim and automatically using it only in cases where it would have errored before. I'd be happy to rework the existing user guides with material on working with different datasets.

@philippjfr
Copy link
Member Author

Let's deal with my suggestion for a melt method in another PR. For now I think we should merge this immediately.

@jlstevens
Copy link
Contributor

Looks good and I agree it should be merged ASAP, just to get some real world testing.

It would be good to update the docs about this in the PR implementing the melt method.

Merging.

@jlstevens jlstevens merged commit 9636446 into master Feb 5, 2018
@philippjfr philippjfr deleted the pandas_index branch February 11, 2018 17:02
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants