-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: groupby.transform with a UDF performance #55256
Comments
Spent a while trying to profile this and had difficultly using various methods. It looks like concat is modifying the argument (the indexes I'm guessing).
This might be completely expected (not sure), but wanted to mention since it might be helpful in case anyone else works on profiling. |
@phofl - I think I've tracked it down to an issue with the reference counting in |
Is this on current main? |
Yes - I'm behind by 1 commit. |
Also - the more I toy around with these groups, the more I'm seeing the reference counting go up. If I run a few different operations on them (e.g. |
@rhshadrach just weighing in here as we saw this issue really hit performance in our implementation. It is caused specifically by grouping over multiple columns(/levels). It requires quite a large dataframe to see the difference (for a small df, O(100) rows, the performance is similar). With a large df (O(1 million) rows), the time difference is more than an order of magnitude. df_dict = {}
for col1, df_slice_1 in df.groupby("col1"):
df_dict[col1] = {}
for col2, df_slice_2 in df_slice_1.groupby("col2"):
df_dict[col1][col2] = _person_data df_dict = {x: {} for x in df["col1"].unique()}
for (col1, col2), df_slice in df.groupby(["col1", "col2"]):
df_dict[col1][col2] = df_slice |
Will need more attention to make sure is safe, but I got a 4x speedup in the OP example (with CoW disabled) by pinning |
Richard and I discussed the same solution on slack, that should be safe |
I also checked, this was picked up by our ASVs: https://asv-runner.github.io/asv-collection/pandas/#groupby.Apply.time_copy_function_multi_col |
I'm going to take this up - if anyone else has started working on this, happy to let you have it. |
I'm still seeing a 30% slowdown between 2.0 and 2.1 without CoW, and an additional 100% slowdown in 2.1 with CoW. It seems to be due to #32669 (comment) (cc @topper-123). This is hit whenever we copy the MultiIndex, which occurs in the |
We're down to a ~15% slowdown, now, I think.
|
Encountered this trying to update some code to use CoW. The regression exists without CoW, but is also worse with it. Haven't done any investigation yet as to why.
cc @phofl @jorisvandenbossche
PS: This code have not been using transform with a UDF 😄
The text was updated successfully, but these errors were encountered: