Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby.apply works incorrect in pandas backend #2511

Closed
dchigarev opened this issue Dec 4, 2020 · 3 comments
Closed

Groupby.apply works incorrect in pandas backend #2511

dchigarev opened this issue Dec 4, 2020 · 3 comments
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby

Comments

@dchigarev
Copy link
Collaborator

dchigarev commented Dec 4, 2020

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): 7458746
  • Python version: 3.7.5
  • Code we can use to reproduce:
import modin.pandas as pd
import pandas
pd.DEFAULT_NPARTITIONS = 4

from modin.pandas.test.utils import test_data_values, create_test_dfs, df_equals

data = test_data_values[0]
md_df, pd_df = create_test_dfs(data)

by = [pd_df.columns[0]]

md_res = md_df.groupby(by).apply(lambda df: pandas.Series([1, 2, 3, 4, 5]))
pd_res = pd_df.groupby(by).apply(lambda df: pandas.Series([1, 2, 3, 4, 5]))

print(f"md_res:\n{md_res}")
print(f"\npd_res:\n{pd_res}")

df_equals(md_res, pd_res) # AssertionError: DataFrame shape mismatch
Output
md_res:
       0  1  2  3  4  0  1  2  3  4
col33
0      1  2  3  4  5  1  2  3  4  5
1      1  2  3  4  5  1  2  3  4  5
2      1  2  3  4  5  1  2  3  4  5
3      1  2  3  4  5  1  2  3  4  5
4      1  2  3  4  5  1  2  3  4  5
...   .. .. .. .. .. .. .. .. .. ..
95     1  2  3  4  5  1  2  3  4  5
96     1  2  3  4  5  1  2  3  4  5
97     1  2  3  4  5  1  2  3  4  5
98     1  2  3  4  5  1  2  3  4  5
99     1  2  3  4  5  1  2  3  4  5

[94 rows x 10 columns]

pd_res:
       0  1  2  3  4
col33
0      1  2  3  4  5
1      1  2  3  4  5
2      1  2  3  4  5
3      1  2  3  4  5
4      1  2  3  4  5
...   .. .. .. .. ..
95     1  2  3  4  5
96     1  2  3  4  5
97     1  2  3  4  5
98     1  2  3  4  5
99     1  2  3  4  5

[94 rows x 5 columns]

AssertionError: DataFrame are different
DataFrame shape mismatch
[left]:  (94, 10)
[right]: (94, 5)

Describe the problem

In pandas, a function that is passed to groupby.apply expects to obtain the whole frame that belongs to a certain group:

>>> def fn(df):
...     print(f"applied fn got this:\n{df}")
...     return df.sum()
...
>>> df
  col0  col1  col2  col3  col4
0    a     0     0     0     0
1    a     1     1     1     1
2    b     2     2     2     2
3    a     3     3     3     3
4    b     4     4     4     4
>>> df.groupby("col0").apply(fn)
applied fn got this:
  col0  col1  col2  col3  col4
0    a     0     0     0     0
1    a     1     1     1     1
3    a     3     3     3     3
applied fn got this:
  col0  col1  col2  col3  col4
2    b     2     2     2     2
4    b     4     4     4     4
     col0  col1  col2  col3  col4
col0
a     aaa     4     4     4     4
b      bb     6     6     6     6

However in modin, because of partitioning, it passes only a part of that frame, that belongs to the current column partition

@dchigarev dchigarev added the bug 🦗 Something isn't working label Dec 4, 2020
@gshimansky
Copy link
Collaborator

Yes there are also longstanding issues with DataFrameGroupBy.apply and DataFrame.apply: #1682, #1686, #2234.

@pyrito
Copy link
Collaborator

pyrito commented Aug 22, 2022

I am able to reproduce this issue on master.

@YarShev
Copy link
Collaborator

YarShev commented Jan 10, 2024

Works on latest master for me so closing.

@YarShev YarShev closed this as completed Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby
Projects
None yet
Development

No branches or pull requests

4 participants