Improve groupby performance while keeping API compatibility #16

devin-petersohn · 2018-07-05T07:22:25Z

There were some completeness regressions:

head/tail
idxmax/idxmin

I ran out of time to completely implement these after the performance updates.

Performance pre-memoization is on-par with pandas. After memoization, it is ~2x faster on 4 cores.

devin-petersohn · 2018-07-05T08:04:39Z

modin/dataframe/test/test_groupby.py

@@ -207,7 +207,7 @@ def test_large_row_groupby():

    ray_df = from_pandas(pandas_df, 2)

-    by = pandas_df['A'].tolist()
+    by = [str(i) for i in pandas_df['A'].tolist()]


This change is necessary because we have column names that are integers, and the result of pandas_df['A'] will be integers. It will follow the codepath where it tries to group by every column.

simon-mo

Great PR!
There are two minor places need fixing concerning code readability.

simon-mo · 2018-07-05T21:55:01Z

modin/dataframe/groupby.py

+        # It is expensive to put this multiple times, so let's just put it once
+        remote_by = ray.put(self._by)
+
+        if len(self) > 1:


Can you change len(self) to len(self._index_grouped), the same effect but more readable.

simon-mo · 2018-07-05T21:59:06Z

modin/dataframe/groupby.py

+                                               self._group_keys,
+                                               self._squeeze)
+                                         + tuple(part.tolist()),
+                                         num_return_vals=len(self))


Same here: len(self._index_grouped) is more readable

devin-petersohn commented Jul 5, 2018

View reviewed changes

kunalgosar self-requested a review July 5, 2018 08:37

simon-mo requested changes Jul 5, 2018

View reviewed changes

devin-petersohn added 3 commits July 5, 2018 15:51

Improving groupby performance while keeping API

4708285

Finalizing update for performance of groupby

fff2b43

Addressing comments

0207335

devin-petersohn force-pushed the groupby_api_complete branch from 3552919 to 0207335 Compare July 5, 2018 22:55

kunalgosar added the Performance 🚀 Performance related issues and pull requests. label Jul 5, 2018

simon-mo approved these changes Jul 6, 2018

View reviewed changes

kunalgosar approved these changes Jul 6, 2018

View reviewed changes

kunalgosar merged commit 05811e7 into modin-project:master Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve groupby performance while keeping API compatibility #16

Improve groupby performance while keeping API compatibility #16

devin-petersohn commented Jul 5, 2018

devin-petersohn Jul 5, 2018

simon-mo left a comment

simon-mo Jul 5, 2018

simon-mo Jul 5, 2018

Improve groupby performance while keeping API compatibility #16

Improve groupby performance while keeping API compatibility #16

Conversation

devin-petersohn commented Jul 5, 2018

devin-petersohn Jul 5, 2018

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo Jul 5, 2018

Choose a reason for hiding this comment

simon-mo Jul 5, 2018

Choose a reason for hiding this comment