-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: parse pandas pivot null values #29898
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this conversion shouldn't be optional.
superset/charts/post_processing.py
Outdated
@@ -150,6 +151,8 @@ def pivot_df( # pylint: disable=too-many-locals, too-many-arguments, too-many-s | |||
if show_rows_total: | |||
# add subtotal for each group and overall total; we start from the | |||
# overall group, and iterate deeper into subgroups | |||
# Ensure "NULL" strings are replaced with NaN | |||
df.replace("NULL", np.nan, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this automatically, ideally there should be an option when building the pivot table to have this conversion. Or people could do it as a derived column or virtual dataset.
Imagine I have a table of users and someone has the username 'NULL'. I don't think we should do this conversion in that case. This is not hypothetical, Instagram's data infra once went down because someone created a user called null
.
(Not as bad as when an employee broke the Facebook intranet by using their initials as their username — www.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of breaking things, looks like the GitHub auto-linker is broken...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, Bobby Tables.
Are you suggesting this chart has a user-defined option to fill nulls with ? or maybe a drop-down of some value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, my name is NULL, and my last name is '); DROP TABLE users; --
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betodealmeida and I met and talked about using a different placeholder string that we thought would be an unlikely "real" value: SUPERSET_PANDAS_NAN
0e44172
to
c34d0bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I misunderstood this. The 'NULL' string is being introduced here:
superset/superset/charts/post_processing.py
Lines 88 to 89 in fb6efb9
# pivoting with null values will create an empty df | |
df = df.fillna("NULL") |
I wonder if (1) we should do the conversion back to nan
in the same function (pivot_df
), and if (2) we should use a different sentinel value?
@@ -171,7 +174,7 @@ def pivot_df( # pylint: disable=too-many-locals, too-many-arguments, too-many-s | |||
for subgroup in subgroups: | |||
slice_ = df.index.get_loc(subgroup) | |||
subtotal = pivot_v2_aggfunc_map[aggfunc]( | |||
df.iloc[slice_, :].apply(pd.to_numeric), axis=0 | |||
df.iloc[slice_, :].apply(pd.to_numeric, errors="coerce"), axis=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might be the only change needed to deal with this issue: #27499
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas is so opaque, especially when you haven't touched it for years - .iloc[]?
, "coerce"!?!? might be worth adding a comment that explains what it's doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's loc
s and iloc
s all the way down 🐢
aggfunc="Sum", | ||
transpose_pivot=False, | ||
combine_metrics=False, | ||
show_rows_total=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weirdly, the case I have where i replicate this NULL issue only occurs when I have one of the column or row total set to True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, I don't think it would fail in this case, but I added a test for all the combos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test coverage for nulls here was well-needed! great
c34d0bb
to
92206f2
Compare
92206f2
to
0ae7342
Compare
@@ -86,7 +87,8 @@ def pivot_df( # pylint: disable=too-many-locals, too-many-arguments, too-many-s | |||
# pivot data; we'll compute totals and subtotals later | |||
if rows or columns: | |||
# pivoting with null values will create an empty df | |||
df = df.fillna("NULL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmmh, seems the frontend should be doing this ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch I'm not sure I got this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch the problem here is that we do the pivot in Pandas (for reports and CSV download), and it will fail if the dataframe has NaN
s.
else: | ||
# when we applied metrics on rows, we switched the columns and rows | ||
# so checking column type doesn't apply. Replace everything with np.nan | ||
df.replace("SUPERSET_PANDAS_NAN", np.nan, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betodealmeida we need this section here (from line 156) when totaling so that we 1) can sum with numbers (by converting the string "SUPERSET_PANDAS_NAN" with np.nan or 2) can sum with a string value. I'm using "nan" so that we don't print "SUPERSET_PANDAS_NAN".
columns={"SUPERSET_PANDAS_NAN": np.nan}, | ||
inplace=True, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting the values back so that we don't print "SUPERSET_PANDAS_NAN"
if pd.api.types.is_numeric_dtype(df[col]): | ||
df[col].replace("SUPERSET_PANDAS_NAN", np.nan, inplace=True) | ||
else: | ||
df[col].replace("SUPERSET_PANDAS_NAN", "nan", inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose the string "nan" here because that is the default behavior when there is a null value when pivoting without sums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job! ❤️
SUMMARY
If a result has a null value for a pivot table with totals, we are seeing this error:
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
ADDITIONAL INFORMATION