Skip to content

Commit

Permalink
perf(postprocessing): improve pivot postprocessing operation
Browse files Browse the repository at this point in the history
Executing a pivot for with `drop_missing_columns=False` and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets.
The main culprit is `df.drop(...)` operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time
to seconds instead of minutes for large datasets (millions of columns).

Fixes apache#23464
  • Loading branch information
Usiel committed Mar 23, 2023
1 parent 7ef06b0 commit 20bc99a
Showing 1 changed file with 2 additions and 5 deletions.
7 changes: 2 additions & 5 deletions superset/utils/pandas_postprocessing/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals
if not drop_missing_columns and columns:
for row in df[columns].itertuples():
for metric in aggfunc.keys():
series_set.add(str(tuple([metric]) + tuple(row[1:])))
series_set.add(tuple([metric]) + tuple(row[1:]))

df = df.pivot_table(
values=aggfunc.keys(),
Expand All @@ -101,10 +101,7 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals
)

if not drop_missing_columns and len(series_set) > 0 and not df.empty:
for col in df.columns:
series = str(col)
if series not in series_set:
df = df.drop(col, axis=PandasAxis.COLUMN)
df = df.drop(df.columns.difference(series_set), axis=PandasAxis.COLUMN)

if combine_value_with_metric:
df = df.stack(0).unstack()
Expand Down

0 comments on commit 20bc99a

Please sign in to comment.