-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve groupby performance while keeping API compatibility #16
Improve groupby performance while keeping API compatibility #16
Conversation
modin/dataframe/test/test_groupby.py
Outdated
@@ -207,7 +207,7 @@ def test_large_row_groupby(): | |||
|
|||
ray_df = from_pandas(pandas_df, 2) | |||
|
|||
by = pandas_df['A'].tolist() | |||
by = [str(i) for i in pandas_df['A'].tolist()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary because we have column names that are integers, and the result of pandas_df['A']
will be integers. It will follow the codepath where it tries to group by every column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR!
There are two minor places need fixing concerning code readability.
modin/dataframe/groupby.py
Outdated
# It is expensive to put this multiple times, so let's just put it once | ||
remote_by = ray.put(self._by) | ||
|
||
if len(self) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change len(self)
to len(self._index_grouped)
, the same effect but more readable.
modin/dataframe/groupby.py
Outdated
self._group_keys, | ||
self._squeeze) | ||
+ tuple(part.tolist()), | ||
num_return_vals=len(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: len(self._index_grouped)
is more readable
3552919
to
0207335
Compare
There were some completeness regressions:
I ran out of time to completely implement these after the performance updates.
Performance pre-memoization is on-par with pandas. After memoization, it is ~2x faster on 4 cores.