Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF/BUG: use masked algo in groupby cummin and cummax #40651

Merged
merged 44 commits into from
Apr 21, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Mar 27, 2021

Potential first step towards #37493

ASV's (on N = 5_000_000 instead of N = 1_000_000 on the added benchmark to get more stable results):

       before           after         ratio
     [3a973b6d]       [2fa80ad0]
     <master>         <perf/masked_cummin/max>
-        297±10ms          241±9ms     0.81  groupby.CumminMax.time_frame_transform('Float64', 'cummax')
-         284±6ms          240±7ms     0.85  groupby.CumminMax.time_frame_transform('Float64', 'cummin')
-        761±60ms          393±4ms     0.52  groupby.CumminMax.time_frame_transform('Int64', 'cummax')
-        753±50ms          390±3ms     0.52  groupby.CumminMax.time_frame_transform('Int64', 'cummin')
         481±20ms         474±10ms     0.99  groupby.CumminMax.time_frame_transform('float64', 'cummax')
         461±10ms         459±10ms     1.00  groupby.CumminMax.time_frame_transform('float64', 'cummin')
         706±40ms         732±50ms     1.04  groupby.CumminMax.time_frame_transform('int64', 'cummax')
         730±20ms         674±20ms     0.92  groupby.CumminMax.time_frame_transform('int64', 'cummin')
-        685±30ms          214±5ms     0.31  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax')
-        698±20ms          225±4ms     0.32  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin')
-        811±50ms          204±6ms     0.25  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax')
-        804±50ms          207±6ms     0.26  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin')
         444±20ms          447±4ms     1.01  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax')
         447±20ms         459±20ms     1.03  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin')
        1.41±0.1s       1.33±0.01s     0.94  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax')
       1.41±0.09s          1.33±0s     0.94  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')

@mzeitlin11 mzeitlin11 added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance labels Mar 27, 2021
@jbrockmendel
Copy link
Member

on a 5x greater N than the benchmark below to get more stable results

im unclear on what this means. can you rephrase?

@mzeitlin11
Copy link
Member Author

on a 5x greater N than the benchmark below to get more stable results

im unclear on what this means. can you rephrase?

Sorry about that definitely confusing, updated comment

["cummin", "cummax"],
]

def setup(self, dtype, method):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this costly? worth using setup_cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to take ~0.25s. So might be worth caching, but appears setup_cache can't be parameterized, so would have to ugly up the benchmark a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make N // 10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep have shrunk benchmark

val_is_nan = True
out[i, j] = val

if not val_is_nan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to implement this as a separate function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about this. The question would be the tradeoff between a bit more complexity/branching vs duplication/increased package size (if we end up adding masked support to a lot more of these grouped algos)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any guess what the impact on package size is?

potential duplication might be addressed by making e.g. refactoring L1340-1347 or L 1302-1308 into helper functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on rough estimate, binaries generated from groupby.pyx take up ~5% of total _libs. So based on the figure of _libs taking up 17MB from #30741, cost of full duplication would be around 0.8-0.9 MB. But like you mentioned above, some duplication could be avoided, 1 MB should be an upper bound.

["cummin", "cummax"],
]

def setup(self, dtype, method):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make N // 10

labels : np.ndarray[np.intp]
Labels to group by.
ngroups : int
Number of groups, larger than all entries of `labels`.
is_datetimelike : bool
True if `values` contains datetime-like entries.
use_mask : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually worth it? we don't do this anywhere else. can you show head to head where you pass this always as True (vs your change method)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to allow mask to be None, so use_mask can be defined inside the function as use_mask = mask is not None (a similar approach is used hashtable_class_helper.pxi for the unique/factorize implementations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, definitely a cleaner solution.

Addressing your comment @jreback, will take a look back at perf diff with forcing mask usage. IIRC the main reason for the optional mask was a slowdown in float with the cost of isna along with mask lookups in the algo ending up larger than existing null checking in cython.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like as much of a slowdown as feared, ASV's with forced mask usage (from 09b73fc)

       before           after         ratio
     [e69df38c]       [293dc6ea]
     <master>         <not_optional_mask>
          324±8ms         277±10ms    ~0.86  groupby.CumminMax.time_frame_transform('Float64', 'cummax')
         326±10ms         294±10ms    ~0.90  groupby.CumminMax.time_frame_transform('Float64', 'cummin')
-         692±8ms         393±10ms     0.57  groupby.CumminMax.time_frame_transform('Int64', 'cummax')
-        688±20ms          387±7ms     0.56  groupby.CumminMax.time_frame_transform('Int64', 'cummin')
          368±7ms         372±20ms     1.01  groupby.CumminMax.time_frame_transform('float64', 'cummax')
          382±3ms         380±20ms     0.99  groupby.CumminMax.time_frame_transform('float64', 'cummin')
-         598±5ms          536±3ms     0.90  groupby.CumminMax.time_frame_transform('int64', 'cummax')
          606±4ms         575±30ms     0.95  groupby.CumminMax.time_frame_transform('int64', 'cummin')
-         645±6ms          266±6ms     0.41  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax')
-        643±10ms         275±10ms     0.43  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin')
-         734±7ms          259±6ms     0.35  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax')
-         752±5ms          251±9ms     0.33  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin')
          364±6ms         382±10ms     1.05  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax')
          362±7ms          378±8ms     1.04  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin')
       1.14±0.02s       1.13±0.02s     0.99  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax')
       1.14±0.02s       1.12±0.03s     0.98  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this forced mask solution is cleaner because along with the simpler cython algo, it would eventually allow cleaning up some of the kludge around NaT/datetimelike. Which can lose information on edge cases with forced casts to float from

elif is_integer_dtype(dtype):
# we use iNaT for the missing value on ints
# so pre-convert to guard this condition
if (values == iNaT).any():
values = ensure_float64(values)

mask = isna(values).copy()
arr = values._data

if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this an entirely different funtion? pls integrate to existing infra

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs).

It could also be another elif check in _ea_wrap_cython_operation, but it's not that it would result in less code or so, and since _ea_wrap_cython_operation already gets quite complicated, I think this separate function is good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an initial version I tried to fold this into _ea_wrap_cython_operation, but thought this smaller function was a cleaner solution since the conditionals here can remain much simpler. While adding a function for just supporting masked cummax, cummin seems wasteful, this infrastructure should extend to more masked groupby algos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments. This MUST integrate with the existing infrastructure (or refactor that). Duplicating is -1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The whole point of this PR is to add a special handling for masked arrays, to ensure to pass through the mask to the groupby cython algo. This will always add some code (and the code in _masked_ea_wrap_cython_operation is specific to masked arrays).
  2. IMO this is integrated with the existing infrastructure: it integrates nicely into the existing _cython_operation and follows the same pattern as we already have for _ea_wrap_cython_operation

If you don't like how the added code is structured right now, please do a concrete suggestion of how you would do it differently.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzeitlin11 thanks for looking into this!

labels : np.ndarray[np.intp]
Labels to group by.
ngroups : int
Number of groups, larger than all entries of `labels`.
is_datetimelike : bool
True if `values` contains datetime-like entries.
use_mask : bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to allow mask to be None, so use_mask can be defined inside the function as use_mask = mask is not None (a similar approach is used hashtable_class_helper.pxi for the unique/factorize implementations)

pandas/core/groupby/ops.py Show resolved Hide resolved
mask = isna(values).copy()
arr = values._data

if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs).

It could also be another elif check in _ea_wrap_cython_operation, but it's not that it would result in less code or so, and since _ea_wrap_cython_operation already gets quite complicated, I think this separate function is good.

@jorisvandenbossche
Copy link
Member

Not directly related to this PR, but something that I noticed in the implementation while reviewing this: the cummin/cummax "propagate" NAs:

In [13]: s = pd.Series([1, 3, np.nan, 2, 4])

In [14]: s
Out[14]: 
0    1.0
1    3.0
2    NaN
3    2.0
4    4.0
dtype: float64

In [15]: s.groupby([1] * 5).cummax()
Out[15]: 
0    1.0
1    3.0
2    NaN
3    3.0
4    4.0
dtype: float64

But since this is a cumulative maximum, that doesn't make much sense to me (the third value is the max of [1.0, 3.0, NaN], which should be 3.0 in our skipna=True logic).
So I would rather expect:

Out[15]: 
0    1.0
1    3.0
2    3.0
3    3.0
4    4.0
dtype: float64

Does somebody know if this is an explicit design? Or whether this was discussed before?

@jorisvandenbossche
Copy link
Member

@jreback can you take another look at this?

@jorisvandenbossche
Copy link
Member

if the goal is for nullable types to become the default in pandas

This decision has not been made.

@jbrockmendel we have the intention, though. Our roadmap says on that topic: "We want to eventually make the new semantics the default." (at the level of abstraction of the "semantics", but currently for numeric dtypes that in practice means using masks)

@mzeitlin11
Copy link
Member Author

@jreback can you take another look at this?

Don't want to take up any more reviewer time just yet - want to take a quick crack at seeing what #40651 (comment) might look like first. Wouldn't want everyone to spend a bunch of time reviewing this, then have it end up being made obsolete by a more general solution.

@mzeitlin11 mzeitlin11 marked this pull request as draft April 15, 2021 19:50
@jorisvandenbossche
Copy link
Member

(I personally think it makes more sense to merge this first, and then have a follow-up PR to refactor how things are called/passed in groupb/ops.py for masked arrays, focusing specifically on that without the cummin/cummax changes here. But leave it to @jbrockmendel and @jreback to state their preference)

@jreback
Copy link
Contributor

jreback commented Apr 16, 2021

yep onboard here to merge this (needs a rebase for some conflicts).

@jbrockmendel
Copy link
Member

ditto

@mzeitlin11 mzeitlin11 marked this pull request as ready for review April 18, 2021 00:55
@mzeitlin11
Copy link
Member Author

have rebased


if is_integer_dtype(arr.dtype) or is_bool_dtype(arr.dtype):
# IntegerArray or BooleanArray
arr = ensure_int_or_float(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW im planning to kill off this function; for EAs this is always just arr.to_numpy(dtype="float64", na_value=np.nan)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing up - realized this whole condition can be simplified since we actually have an ndarray at this point

@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

can you merge master once more (small conflict), but want to have CI run.

@jreback jreback added this to the 1.3 milestone Apr 21, 2021
@jreback jreback merged commit 5eade6e into pandas-dev:master Apr 21, 2021
@jreback
Copy link
Contributor

jreback commented Apr 21, 2021

thanks @mzeitlin11

happy to take followups on cleaning

@mzeitlin11 mzeitlin11 deleted the perf/masked_cummin/max branch April 21, 2021 12:57
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants