[DataFrame] Refactor GroupBy Methods and Implement Reindex #2101

kunalgosar · 2018-05-19T09:27:08Z

Some of the changes in this PR are:

Creates a test suite for groupby methods
Fixes many of the groupby methods
Fixes the case where there is only one group after groupby
Fixes bugs where _block_partitions was 1D
Implements df.reindex
Fixes df.apply and df.agg

AmplabJenkins · 2018-05-19T10:34:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5497/
Test PASSed.

AmplabJenkins · 2018-05-19T10:53:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5498/
Test PASSed.

AmplabJenkins · 2018-05-20T21:32:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5517/
Test PASSed.

kunalgosar · 2018-05-20T23:13:15Z

python/ray/dataframe/dataframe.py

+                # Sometimes we only get a single column or row, which is
+                # problematic for building blocks from the partitions, so we
+                # add whatever dimension we're missing from the input.
+                if self._block_partitions.ndim < 2:


Make _block_partitions a property and move the check to there.

AmplabJenkins · 2018-05-20T23:45:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5523/
Test PASSed.

AmplabJenkins · 2018-05-21T00:18:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5525/
Test PASSed.

devin-petersohn · 2018-05-20T23:14:46Z

python/ray/dataframe/dataframe.py

@@ -656,7 +657,10 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
        Returns:
            A new DataFrame resulting from the groupby.
        """
+        from .groupby import DataFrameGroupBy


Move to just before the DataFrameGroupBy object is used.

devin-petersohn · 2018-05-20T23:15:06Z

python/ray/dataframe/dataframe.py

@@ -656,7 +657,10 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
        Returns:
            A new DataFrame resulting from the groupby.
        """
+        from .groupby import DataFrameGroupBy
+


devin-petersohn · 2018-05-20T23:15:12Z

python/ray/dataframe/dataframe.py

        axis = pd.DataFrame()._get_axis_number(axis)
+


devin-petersohn · 2018-05-21T00:11:26Z

python/ray/dataframe/groupby.py

+                new_df.index = [k for k, v in self._iter]
+            else:
+                new_df = concat(result)
+                new_df = new_df.reindex(self._index, axis=0)


prefer axis=self._axis

devin-petersohn · 2018-05-21T00:12:33Z

python/ray/dataframe/groupby.py

+                new_df.index = self._index
+            else:
+                new_df = concat(result, axis=1)
+                new_df = new_df.reindex(self._columns, axis=1)


You can use utils._reindex_helper to more efficiently reorder the columns/rows. Just make sure you reassign new_df.index or new_df.columns depending on the correct reassignment.

devin-petersohn · 2018-05-21T00:15:55Z

python/ray/dataframe/groupby.py


-        from .concat import concat
+        if self._axis == 0:
+            new_df = new_df.reindex(self._index, axis=0)


Same here for utils._reindex_helper

devin-petersohn · 2018-05-21T00:21:07Z

python/ray/dataframe/test/test_groupby.py

+@pytest.fixture
+def ray_df_equals_pandas(ray_df, pandas_df):
+    assert isinstance(ray_df, pd.DataFrame)
+    assert to_pandas(ray_df).sort_index().equals(pandas_df.sort_index())


remove sort_index() from this file on checks

devin-petersohn · 2018-05-21T00:24:20Z

python/ray/dataframe/utils.py

-    return np.array(x) if axis == 0 else np.array(x).T
+    blocks = np.array(x) if axis == 0 else np.array(x).T
+
+    # Sometimes we only get a single column or row, which is


Move this next part to a utils function and call from within the _block_partitions property.

I can't move it to a property because it depends on axis, but I have moved it to a utils function.

AmplabJenkins · 2018-05-21T00:49:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5529/
Test PASSed.

AmplabJenkins · 2018-05-22T04:02:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5553/
Test PASSed.

AmplabJenkins · 2018-05-22T06:45:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5558/
Test PASSed.

AmplabJenkins · 2018-05-22T06:56:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5560/
Test PASSed.

AmplabJenkins · 2018-05-22T08:49:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5566/
Test PASSed.

devin-petersohn · 2018-05-22T15:53:41Z

python/ray/dataframe/utils.py

@@ -205,6 +205,20 @@ def _deploy_func(func, dataframe, *args):
        return func(dataframe, *args)


+@ray.remote
+def _deploy_generic_func(func, *args):


I don't know that we need this. I see how you're using it, but for now I would just prefer _deploy_func like everything else and pass in a row/column partition.

devin-petersohn · 2018-05-22T15:55:24Z

python/ray/dataframe/dataframe.py

+        if index is not None:
+            old_index = self.index
+            new_blocks = np.array([_deploy_generic_func._submit(
+                args=(tuple([reindex_helper, old_index, index, 1,


For the tuple([...] + block.tolist()) you can just do (...) + tuple(block.tolist()). I think it seems more clear this way.

AmplabJenkins · 2018-05-22T21:18:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5573/
Test PASSed.

devin-petersohn · 2018-05-22T23:34:13Z

Passes on private-travis. Thanks @kunalgosar!

* master: [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101) Initial Support for Airspeed Velocity (ray-project#2113) Use automatic memory management in Redis modules. (ray-project#1797) [DataFrame] Test bugfixes (ray-project#2111) [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)

* master: Prototype named actors. (ray-project#2129) Update arrow to latest master (ray-project#2100) [DataFrame] Speed up dtypes (ray-project#2118) do not fetch from dead Plasma Manager (ray-project#2116) [DataFrame] Refactor GroupBy Methods and Implement Reindex (ray-project#2101) Initial Support for Airspeed Velocity (ray-project#2113) Use automatic memory management in Redis modules. (ray-project#1797) [DataFrame] Test bugfixes (ray-project#2111) [DataFrame] Update initializations of IndexMetadata which use outdated APIs (ray-project#2103)

kunalgosar commented May 20, 2018

View reviewed changes

devin-petersohn reviewed May 21, 2018

View reviewed changes

kunalgosar added 20 commits May 21, 2018 22:36

fix 1D blocks case

f8fd93a

Add groupby test code

bb2b761

begin writing tests

88e0919

Fix bug on groupby(axis=1, ...)

58ad669

implement reindex

c3511cf

fix index misalignment after groupby

d6337c9

fix df.apply bug

f9db167

fix groupby.apply

15cbef7

fix agg funcs

f5cdf68

finish groupby tests

5b6d1a3

finish test suite for groupby

78fde91

fixing lint

22ab723

undo new line

21abecb

fix python2 index copy bug

e5b1904

Concat Series into ray.df

9a87026

fixing python2 issues

54544cc

resolving all python 2 tests

18645ed

handle multiindex on apply

df5697e

resolve comments

fa5b540

updating docstring

ec18852

kunalgosar force-pushed the groupby_methods branch from 1a05681 to ec18852 Compare May 22, 2018 05:36

fix lint

f211b46

fix lint again

fed3294

devin-petersohn reviewed May 22, 2018

View reviewed changes

address comments

a3722c6

devin-petersohn approved these changes May 22, 2018

View reviewed changes

devin-petersohn merged commit 4584193 into ray-project:master May 22, 2018

kunalgosar deleted the groupby_methods branch May 23, 2018 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Refactor GroupBy Methods and Implement Reindex #2101

[DataFrame] Refactor GroupBy Methods and Implement Reindex #2101

kunalgosar commented May 19, 2018 •

edited

Loading

AmplabJenkins commented May 19, 2018

AmplabJenkins commented May 19, 2018

AmplabJenkins commented May 20, 2018

kunalgosar May 20, 2018

AmplabJenkins commented May 20, 2018

AmplabJenkins commented May 21, 2018

devin-petersohn May 20, 2018

devin-petersohn May 20, 2018

devin-petersohn May 20, 2018

devin-petersohn May 21, 2018

devin-petersohn May 21, 2018

devin-petersohn May 21, 2018

devin-petersohn May 21, 2018

devin-petersohn May 21, 2018

kunalgosar May 22, 2018

AmplabJenkins commented May 21, 2018

AmplabJenkins commented May 22, 2018

AmplabJenkins commented May 22, 2018

AmplabJenkins commented May 22, 2018

AmplabJenkins commented May 22, 2018

devin-petersohn May 22, 2018

devin-petersohn May 22, 2018

AmplabJenkins commented May 22, 2018

devin-petersohn commented May 22, 2018

[DataFrame] Refactor GroupBy Methods and Implement Reindex #2101

[DataFrame] Refactor GroupBy Methods and Implement Reindex #2101

Conversation

kunalgosar commented May 19, 2018 • edited Loading

AmplabJenkins commented May 19, 2018

AmplabJenkins commented May 19, 2018

AmplabJenkins commented May 20, 2018

Choose a reason for hiding this comment

AmplabJenkins commented May 20, 2018

AmplabJenkins commented May 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented May 21, 2018

AmplabJenkins commented May 22, 2018

AmplabJenkins commented May 22, 2018

AmplabJenkins commented May 22, 2018

AmplabJenkins commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented May 22, 2018

devin-petersohn commented May 22, 2018

kunalgosar commented May 19, 2018 •

edited

Loading