Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more flexible type of input arguments for px functions #1768

Merged
merged 69 commits into from
Oct 4, 2019

Conversation

emmanuelle
Copy link
Contributor

This is a first a draft for accepting more flexible types of input arguments for px functions, eg numpy arrays, dataframe columns, mixed type etc. At the moment the transformation done by build_or_augment_dataframe is done argument by argument, so having different arguments with different types is ok.

Todo:

  • write more tests
  • decide what to do with arguments which are lists, such as hover_data, custom_data, dimensions. At the moment they are just skipped so they should be column names. I did this because I wanted to discuss the names they should have, should it be customdata_0, customdata_1 etc. ?

For the cases which are taken into account, please see the test file.

Closes #1767 , follow-up on plotly/plotly_express#87 (thank you @malmaud !)

@emmanuelle
Copy link
Contributor Author

One other thing to fix : px.scatter(tips, x="bla", y='wrong') does not produce a meaningful error message any more, have to fix this.

@emmanuelle
Copy link
Contributor Author

TODO
Additional cases to take into account: check that error messages make sense when:

  • having different lengths for different kw arguments (or dimensions not compatible with the given DataFrame)
  • what happens with 2D arrays passed as parameters

At the moment the initial DataFrame is modified in-place, probably we should not do that (unless the DataFrame is very big??).

@emmanuelle
Copy link
Contributor Author

next step: start from empty dataframe and add columns

@nicolaskruchten
Copy link
Contributor

We should also move the "non-existent column" check and error message into this new function, which I think will help with some of this logic.

@emmanuelle emmanuelle changed the title [WIP] more flexible type of input arguments for px functions more flexible type of input arguments for px functions Sep 16, 2019
@emmanuelle
Copy link
Contributor Author

@nicolaskruchten this is ready for a round of review. In the meantime I'll write a better docstring for the helper function.

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
df = pd.DataFrame()

# Retrieve labels (to change column names)
labels = args.get("labels") # labels or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it important to move the (re-)labelling logic into this function? Right now it all mostly happens later with various helper functions. If we're moving this logic into this function then should we refactor it out of where it is right now...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's better to leave the labelling logic later as it was before, this should be resolved now.

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
@emmanuelle
Copy link
Contributor Author

For the docstrings of px functions I chose to use the name array_like which is commonly used in numpy to refer to both arrays and lists, and in the broad sense dataframes as well. This is hopefully ready for another round of review.

# Initialize set of column names
# These are reserved names
if df_provided:
reserved_names = _initialize_argument_col_names(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency-wise, maybe we should rename this function _get_reserved_col_names or something?

(pandas series type).
"""
df = args["data_frame"]
used_col_names = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we call it used_col_names but elsewhere it's reserved_names

@@ -754,6 +754,215 @@ def apply_default_cascade(args):
args["marginal_x"] = None


def _name_heuristic(argument, field_name, used_col_names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we call it used_col_names but elsewhere it's reserved_names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, there were a few iterations, hence the different names. I will change this.

col_name = argument
if isinstance(argument, int):
col_name = _name_heuristic(argument, field, reserved_names)
if field_name not in array_attrables:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we remove this if/else block as well as the continue and let it "fall through" to the bottom of the loop where we have the same logic?

@@ -813,6 +813,12 @@ def build_dataframe(args, attrables, array_attrables):
array_attrables : list
argument names corresponding to iterables, such as `hover_data`, ...
"""
args = dict(input_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this makes a deep copy...

>>> x= [1,2,3]
>>> y = dict(a=x)
>>> z = dict(y)
>>> z["a"] is x
True
>>> z["a"][0]=100
>>> z
{'a': [100, 2, 3]}
>>> x
[100, 2, 3]

But also think this is only an issue for array_attrables where we write into args[something][something] so if we just copy those that's enough:

>>> x = [1,2,3]
>>> y = list(x)
>>> y is x
False
>>> y[0] = 100
>>> x
[1, 2, 3]

Copy link
Contributor Author

@emmanuelle emmanuelle Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no! I knew it does a deep copy for a list so I had wrongly assumed I could transpose this pattern to a dict.

if field in array_attrables and isinstance(
args[field], pd.core.indexes.base.Index
):
args[field] = list(args[field])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following from comment above... if we just always do this, instead of in an if then I think we're safe (for now!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! Then we don't have to do a deepcopy of the whole args, it's better not to copy the dataframe in particular.

Copy link
Contributor

@jonmmease jonmmease left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @emmanuelle. I left some minor comments inline.

I did find the automatic naming behavior a little confusing. What is the simplest way to explain when the .name property of an input series is kept vs. when the arg name is used?

continue
if isinstance(arg, str):
reserved_names.add(arg)
if isinstance(arg, pd.core.series.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.core.series.Series -> pd.Series. I believe pd.core is considered private by pandas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks, I got the names by asking type(df) but more readable names are better!

)
col_name = argument
if isinstance(argument, int):
col_name = _name_heuristic(argument, field, reserved_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the if here. Can we always wrap argument in _name_heuristic(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out we could write simpler code here. Thanks !

for arg in names:
if arg is None:
continue
if isinstance(arg, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing. Since it looks like these three if branches are mutually exclusive, I'd prefer if elif elif to make it clear than there isn't any multi-layer logic going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree except for cases where the first if results in a continue or errors out ;)

for arg in names:
if arg is None:
continue
if isinstance(arg, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this logic skips over int column labels. Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really... but this list of reserved names is tested against the names of keyword arguments which will never be ints (I think) so it should be ok. On the other hand it makes no harm to add int names to the list...

df_output[col_name] = df_input[argument]
# ----------------- argument is a column / array / list.... -------
else:
is_index = isinstance(argument, pd.core.indexes.range.RangeIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.core.indexes.range.RangeIndex -> pd.RangeIndex

if argument is None:
continue
# Case of multiindex
if isinstance(argument, pd.core.indexes.multi.MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.core.indexes.multi.MultiIndex -> pd.MultiIndex

@nicolaskruchten
Copy link
Contributor

What is the simplest way to explain when the .name property of an input series is kept vs. when the arg name is used?

I believe that with what we landed on, the simplest explanation is that .name is used only to check if the input is the same object as the correspondingly-named column in data_frame (if data_frame is provided) otherwise it's ignored. Doing more than that caused all sorts of weird(er) conflicts!

@emmanuelle
Copy link
Contributor Author

Thanks a lot for your review @jonmmease, we have iterated a lot on this piece of code and having a fresh eye is wonderful. I think I addressed all your comments, which helped to improve the readability of the code.
Regarding when to use name, as explained by @nicolaskruchten we only used names of series equal to columns of the dataframe argument. With the possibility to pass arguments coming from different dataframes, handling conflicts was very tricky and hard to explain. Another possibility would have been to take the name whenever it exists and raise an error when a conflict arises, but we preferred to have a strict approach in the beginning and it is still possible in future versions to see if we want to change this behaviour.

@nicolaskruchten
Copy link
Contributor

nicolaskruchten commented Sep 28, 2019

Let's do a pass where we ensure all error messages have correct spacing between words:

ValueError: All arguments should have the same length.The length of argument color is 3, whereas thelength of previous arguments ['x', 'y', 'facet_row', 'facet_col'] is 4 ... "length.The length" and "thelength".

Similar: String or int arguments are only possible when aDataFrame or an array is provided in the data_frameargument. No DataFrame was provided, but argument 'hover_data_0'is of type str or int.

keep_name = df_provided and argument is df_input.index
else:
# we use getattr/hasattr because of index
keep_name = hasattr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df = px.data.tips()
px.scatter(df, x=df["size"], y=df.tip)

This has a weird behaviour: it doesn't pick up the name "size" even though it is a valid column... This is almost certainly because df.size doesn't return df["size"]. Thanks Pandas, again :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution is to look up columns via df_input["name"] and not getattr and just special-case the index thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it was already a special case, so the modification was easy

@nicolaskruchten
Copy link
Contributor

OK, other than the spaces in the error messages and the df.size thing I think we're good! and we finally have the start of a decent PX test suite: yay!

keep_name = hasattr(
df_input, col_name
) and argument is getattr(df_input, col_name)
keep_name = col_name in df_input and argument.equals(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to move from is to .equals() here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also we should remove the comment above now ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it was intentional, I think it's enough that the values are all equal. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I reverted to is

@nicolaskruchten
Copy link
Contributor

💃 💃 💃 thanks so much for getting this thing over the finish line!

@emmanuelle
Copy link
Contributor Author

yeehaw ! Merging :-)

@emmanuelle emmanuelle merged commit 4a30dc6 into master Oct 4, 2019
@emmanuelle emmanuelle deleted the px-input-arguments branch October 4, 2019 01:50
@jason-curtis
Copy link
Contributor

Would love to consume this from PyPI! Is there a release planned?

@nicolaskruchten
Copy link
Contributor

Version 4.2 should come out this week, yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plotly Express column input formats
4 participants