-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling pandas Series and single column DataFrames #2164
Conversation
This is really helpful, thanks! Given that the output of so many Pandas operations is a single-column dataframe or series with an index, we can now plot these without having to reset the index or specify the dimensions explicitly. |
So I've got some ideas on how to improve this a bit further. I'd suggest that if the user references an index as a key dimension by name we could also reset the index so it can be used by the interface. For dataframes with an index column with no name you should also simply be able to reference it by supplying 'index' as the key dimension. So you might do something like this, a dataframe with three columns of which you want to plot one: df = pd.DataFrame({'A': range(10), 'B': np.random.rand(10), 'C': np.random.randn(10)})
hv.Curve(df, 'index', 'A') This also got me thinking about "wide" datasets and what the best way would be to let the user get them into HoloViews. I feel fairly strongly that data should generally be tidy when working with HoloViews, since each column in a wide dataset does not correspond to a value dimension, instead the column names represent values along a secondary dimension. I'd therefore suggest we provide an easy way to "melt" a dataset. df = pd.DataFrame({'A': range(10), 'B': np.random.rand(10), 'C': np.random.randn(10)})
hv.Curve(df, 'index').melt() It's gone from a wide to a tiny format and given the value dimension the name
This time we've given the index, value and group dimensions actual names. |
Sounds good!
Your examples look good, and I think they will help address #2162. But it's hard for me to see what |
It would display just the first column, so I do think it's meaningful. It's even useful for gridded data where you might have multiple value arrays in your hv.Image(xrds).melt()
(Almost) all elements except annotations are |
Hmm. It seems more natural to me to do that sort of data manipulation before the object is visualizable, but I'm not sure. Can you make an example where you melt an hv.Dataset, then display it as a table, and only then make a Curve out of it, so that it's clear what the melting did? |
These two would be equivalent: hv.Dataset(df, 'index').melt().to(hv.Curve, 'index').overlay() and: hv.Curve(df, 'index').melt().overlay() |
Makes sense, thanks! Might be good to introduce it that way in the docs so that it's clear what it's doing, then point out that the second way is more convenient and does the same thing. Shouldn't melt() just assume that it should use the index column as the kdim, as the default? |
I definitely support this feature though I would need to think carefully about how intuitive this behavior would be. The goal is to have things 'just work' without messing around people who might be making certain assumptions about the index on their dataframes. Whatever we do, we need a new user guide talking about tidy vs wide formats and the melt method when it is introduced. |
More complex and automatic handling of pandas indexes is going to have to wait a while longer as it has potential backward compatibility implications. For now this will allow using an index if requested as kdim and automatically using it only in cases where it would have errored before. I'd be happy to rework the existing user guides with material on working with different datasets. |
fe57f1a
to
b5f67e1
Compare
Let's deal with my suggestion for a |
b5f67e1
to
fd840b5
Compare
Looks good and I agree it should be merged ASAP, just to get some real world testing. It would be good to update the docs about this in the PR implementing the Merging. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR adds support for handling pandas and dask
Series
and single columnDataFrame
s by resetting their index, turning it into a dimension. It should not have any backwards compatibility issues, except that constructing aDataset
from a single columnDataFrame
will now use the index and make it a dimension, while previously it would have just had the single column as a dimension. In all other cases it would have previously errored.