Skip to content
This repository has been archived by the owner on Jun 3, 2024. It is now read-only.

Plotly express should check if input data is tidy #141

Closed
matanox opened this issue Sep 18, 2019 · 12 comments
Closed

Plotly express should check if input data is tidy #141

matanox opened this issue Sep 18, 2019 · 12 comments

Comments

@matanox
Copy link

matanox commented Sep 18, 2019

Currently, it's easy to run into internal errors like follows:

 --> 277  hover_lines = [k + "=" + v for k, v in mapping_labels.items()]
     278  result["hovertemplate"] = hover_header + "<br>".join(hover_lines)
     279  return result, fit_results

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Whereas it would be good to check that the input dataframe is fine before charting, and issue a pinpointed error message rather than fail internally following the dirty python tradition.

@matanox
Copy link
Author

matanox commented Sep 18, 2019

How to minimally reproduce:

conda install -c plotly plotly_express==0.4.0

import pandas as pd
import plotly.express as px

lengths = pd.DataFrame(list(range(1000)))
fig = px.histogram(lengths)
fig.show()

The root cause is an ill-typed dataframe column: the column name is an int when the dataframe is created like so, but it would be good to verify inputs rather than fail so internally. I fail to see how this will negatively impact rendering time, but it will certainly avoid developer usability issues. It's possible to make the verifications conditional so that there is no performance penalty after user code is ready, this will make plotly express much more usable and will boost productivity for users.

The solution to this particular case is to explicitly set the column value as a string, something like so:

lengths = pd.DataFrame(list(range(1000))), columns=['length'])
fig = px.histogram(lengths, x='length')

@matanox
Copy link
Author

matanox commented Sep 18, 2019

Speaking of tidy, isn't the tidy manifest you link from the docs talking about column names being values and not names?

@emmanuelle
Copy link
Contributor

Thank you @matanster, this case should indeed either fail gracefully or be handled correctly. We're in the process of accepting a larger variety of input arguments in px (see plotly/plotly.py#1768), we'll see how to handle your case while working on this PR.

@nicolaskruchten
Copy link
Contributor

Indeed, erroring out here is not the right thing.

I believe that code like the following should just result in an empty plot because nothing is being mapped:

import plotly.express as px
import pandas as pd

px.histogram(pd.DataFrame([1,2,3]))

@nicolaskruchten
Copy link
Contributor

Speaking of tidy, isn't the tidy manifest you link from the docs talking about column names being values and not names?

I'm not sure what this means :)

@matanox
Copy link
Author

matanox commented Sep 19, 2019

Thanks for your positive attitude! it seems that in defiance of python tradition, an input-safe library would be even more awesome than what plotly/express already is, especially given that charts need to be developed ad-hoc very typically, by people who don't recall every caveat or limitation of the API at the moments when they come to create, iterate, and choose between visualizations.

@matanox
Copy link
Author

matanox commented Sep 19, 2019

Here's where one of the current top results for plotly express on google mentions a concept they call tidy, or a tidy dataframe:

image

And here's the link for the "tidy dataframes manifesto" as much as it matters. It says in there that column names should be values and not names, which is conceptually quite opposite to limiting the type allowed for column names in dataframe input to plotly express. Since it's a long write-up there, here's a screenshot where they say and exemplify that:

image

Sorry for the repetitive nature of this follow-up, but I hope it clarifies about where I got tidy from ... and how it is mildly related ...

@nicolaskruchten
Copy link
Contributor

Oh ok I see what you're referring to. That section header "Column headers are values, not variable names" is actually part of a list of anti-patterns, i.e. what tidy data is not, if you look at the paragraph preceding:

image

@matanox
Copy link
Author

matanox commented Sep 19, 2019

Mmmmm. sorry about that.
Either way such a lengthy blog post can't be relied upon too much as a resource for how to use a given API. At least not the way that post from 2016 is.

@nicolaskruchten
Copy link
Contributor

Fair enough.

In any case, we are working on more flexible input types, although the "tidy" philosophy will remain, i.e. you'll still need to concatenate your vectors to do this sort of thing: https://stackoverflow.com/questions/57988604, you just won't be required to stick them into a data frame first :)

@nicolaskruchten
Copy link
Contributor

Today, either of these will work:

import pandas as pd
import plotly.express as px

lengths = pd.DataFrame(list(range(1000)))
fig = px.histogram(lengths, x=0)
fig.show()

fig = px.histogram(x=range(1000))
fig.show()

@nicolaskruchten
Copy link
Contributor

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants