perf(postprocessing): improve pivot postprocessing operation

Executing a pivot for with `drop_missing_columns=False` and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets. The main culprit is `df.drop(...)` operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time to seconds instead of minutes for large datasets (millions of columns). Fixes apache#23464
Usiel · Mar 23, 2023 · 20bc99a · 20bc99a
1 parent 7ef06b0
commit 20bc99a
Showing 1 changed file with 2 additions and 5 deletions.
diff --git a/superset/utils/pandas_postprocessing/pivot.py b/superset/utils/pandas_postprocessing/pivot.py
@@ -87,7 +87,7 @@ def pivot(  # pylint: disable=too-many-arguments,too-many-locals
     if not drop_missing_columns and columns:
         for row in df[columns].itertuples():
             for metric in aggfunc.keys():
-                series_set.add(str(tuple([metric]) + tuple(row[1:])))
+                series_set.add(tuple([metric]) + tuple(row[1:]))
 
     df = df.pivot_table(
         values=aggfunc.keys(),
@@ -101,10 +101,7 @@ def pivot(  # pylint: disable=too-many-arguments,too-many-locals
     )
 
     if not drop_missing_columns and len(series_set) > 0 and not df.empty:
-        for col in df.columns:
-            series = str(col)
-            if series not in series_set:
-                df = df.drop(col, axis=PandasAxis.COLUMN)
+        df = df.drop(df.columns.difference(series_set), axis=PandasAxis.COLUMN)
 
     if combine_value_with_metric:
         df = df.stack(0).unstack()