PERF/BUG: use masked algo in groupby cummin and cummax #40651

mzeitlin11 · 2021-03-27T00:28:20Z

Potential first step towards #37493

ASV's (on N = 5_000_000 instead of N = 1_000_000 on the added benchmark to get more stable results):

       before           after         ratio
     [3a973b6d]       [2fa80ad0]
     <master>         <perf/masked_cummin/max>
-        297±10ms          241±9ms     0.81  groupby.CumminMax.time_frame_transform('Float64', 'cummax')
-         284±6ms          240±7ms     0.85  groupby.CumminMax.time_frame_transform('Float64', 'cummin')
-        761±60ms          393±4ms     0.52  groupby.CumminMax.time_frame_transform('Int64', 'cummax')
-        753±50ms          390±3ms     0.52  groupby.CumminMax.time_frame_transform('Int64', 'cummin')
         481±20ms         474±10ms     0.99  groupby.CumminMax.time_frame_transform('float64', 'cummax')
         461±10ms         459±10ms     1.00  groupby.CumminMax.time_frame_transform('float64', 'cummin')
         706±40ms         732±50ms     1.04  groupby.CumminMax.time_frame_transform('int64', 'cummax')
         730±20ms         674±20ms     0.92  groupby.CumminMax.time_frame_transform('int64', 'cummin')
-        685±30ms          214±5ms     0.31  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax')
-        698±20ms          225±4ms     0.32  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin')
-        811±50ms          204±6ms     0.25  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax')
-        804±50ms          207±6ms     0.26  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin')
         444±20ms          447±4ms     1.01  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax')
         447±20ms         459±20ms     1.03  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin')
        1.41±0.1s       1.33±0.01s     0.94  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax')
       1.41±0.09s          1.33±0s     0.94  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')

jbrockmendel · 2021-03-27T02:36:26Z

on a 5x greater N than the benchmark below to get more stable results

im unclear on what this means. can you rephrase?

mzeitlin11 · 2021-03-27T03:15:46Z

on a 5x greater N than the benchmark below to get more stable results

im unclear on what this means. can you rephrase?

Sorry about that definitely confusing, updated comment

jbrockmendel · 2021-03-28T18:30:32Z

asv_bench/benchmarks/groupby.py

+        ["cummin", "cummax"],
+    ]
+
+    def setup(self, dtype, method):


is this costly? worth using setup_cache?

Looks to take ~0.25s. So might be worth caching, but appears setup_cache can't be parameterized, so would have to ugly up the benchmark a bit.

can we just make N // 10

yep have shrunk benchmark

jbrockmendel · 2021-03-28T18:31:43Z

pandas/_libs/groupby.pyx

+                        val_is_nan = True
+                        out[i, j] = val
+
+                if not val_is_nan:


does it make sense to implement this as a separate function?

I don't have a strong opinion about this. The question would be the tradeoff between a bit more complexity/branching vs duplication/increased package size (if we end up adding masked support to a lot more of these grouped algos)

any guess what the impact on package size is?

potential duplication might be addressed by making e.g. refactoring L1340-1347 or L 1302-1308 into helper functions

Based on rough estimate, binaries generated from groupby.pyx take up ~5% of total _libs. So based on the figure of _libs taking up 17MB from #30741, cost of full duplication would be around 0.8-0.9 MB. But like you mentioned above, some duplication could be avoided, 1 MB should be an upper bound.

pandas/core/groupby/ops.py

jreback · 2021-03-29T13:57:23Z

asv_bench/benchmarks/groupby.py

+        ["cummin", "cummax"],
+    ]
+
+    def setup(self, dtype, method):


can we just make N // 10

jreback · 2021-03-29T13:58:37Z

pandas/_libs/groupby.pyx

    labels : np.ndarray[np.intp]
        Labels to group by.
    ngroups : int
        Number of groups, larger than all entries of `labels`.
    is_datetimelike : bool
        True if `values` contains datetime-like entries.
+    use_mask : bool


is this actually worth it? we don't do this anywhere else. can you show head to head where you pass this always as True (vs your change method)

Another option is to allow mask to be None, so use_mask can be defined inside the function as use_mask = mask is not None (a similar approach is used hashtable_class_helper.pxi for the unique/factorize implementations)

Good call, definitely a cleaner solution.

Addressing your comment @jreback, will take a look back at perf diff with forcing mask usage. IIRC the main reason for the optional mask was a slowdown in float with the cost of isna along with mask lookups in the algo ending up larger than existing null checking in cython.

Doesn't seem like as much of a slowdown as feared, ASV's with forced mask usage (from 09b73fc)

before after ratio [e69df38c] [293dc6ea] <master> <not_optional_mask> 324±8ms 277±10ms ~0.86 groupby.CumminMax.time_frame_transform('Float64', 'cummax') 326±10ms 294±10ms ~0.90 groupby.CumminMax.time_frame_transform('Float64', 'cummin') - 692±8ms 393±10ms 0.57 groupby.CumminMax.time_frame_transform('Int64', 'cummax') - 688±20ms 387±7ms 0.56 groupby.CumminMax.time_frame_transform('Int64', 'cummin') 368±7ms 372±20ms 1.01 groupby.CumminMax.time_frame_transform('float64', 'cummax') 382±3ms 380±20ms 0.99 groupby.CumminMax.time_frame_transform('float64', 'cummin') - 598±5ms 536±3ms 0.90 groupby.CumminMax.time_frame_transform('int64', 'cummax') 606±4ms 575±30ms 0.95 groupby.CumminMax.time_frame_transform('int64', 'cummin') - 645±6ms 266±6ms 0.41 groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax') - 643±10ms 275±10ms 0.43 groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin') - 734±7ms 259±6ms 0.35 groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax') - 752±5ms 251±9ms 0.33 groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin') 364±6ms 382±10ms 1.05 groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax') 362±7ms 378±8ms 1.04 groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin') 1.14±0.02s 1.13±0.02s 0.99 groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax') 1.14±0.02s 1.12±0.03s 0.98 groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')

I think this forced mask solution is cleaner because along with the simpler cython algo, it would eventually allow cleaning up some of the kludge around NaT/datetimelike. Which can lose information on edge cases with forced casts to float from

pandas/pandas/core/groupby/ops.py

Lines 673 to 677 in 4554635

elif is_integer_dtype(dtype):

# we use iNaT for the missing value on ints

# so pre-convert to guard this condition

if (values == iNaT).any():

values = ensure_float64(values)

jreback · 2021-03-29T14:00:48Z

pandas/core/groupby/ops.py

+        mask = isna(values).copy()
+        arr = values._data
+
+        if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype):


why is this an entirely different funtion? pls integrate to existing infra

This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs).

It could also be another elif check in _ea_wrap_cython_operation, but it's not that it would result in less code or so, and since _ea_wrap_cython_operation already gets quite complicated, I think this separate function is good.

In an initial version I tried to fold this into _ea_wrap_cython_operation, but thought this smaller function was a cleaner solution since the conditionals here can remain much simpler. While adding a function for just supporting masked cummax, cummin seems wasteful, this infrastructure should extend to more masked groupby algos.

see my comments. This MUST integrate with the existing infrastructure (or refactor that). Duplicating is -1

The whole point of this PR is to add a special handling for masked arrays, to ensure to pass through the mask to the groupby cython algo. This will always add some code (and the code in _masked_ea_wrap_cython_operation is specific to masked arrays).

IMO this is integrated with the existing infrastructure: it integrates nicely into the existing _cython_operation and follows the same pattern as we already have for _ea_wrap_cython_operation

If you don't like how the added code is structured right now, please do a concrete suggestion of how you would do it differently.

jorisvandenbossche

@mzeitlin11 thanks for looking into this!

jorisvandenbossche · 2021-03-31T09:15:25Z

pandas/_libs/groupby.pyx

    labels : np.ndarray[np.intp]
        Labels to group by.
    ngroups : int
        Number of groups, larger than all entries of `labels`.
    is_datetimelike : bool
        True if `values` contains datetime-like entries.
+    use_mask : bool


Another option is to allow mask to be None, so use_mask can be defined inside the function as use_mask = mask is not None (a similar approach is used hashtable_class_helper.pxi for the unique/factorize implementations)

pandas/core/groupby/ops.py

jorisvandenbossche · 2021-03-31T19:15:25Z

pandas/core/groupby/ops.py

+        mask = isna(values).copy()
+        arr = values._data
+
+        if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype):


This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs).

It could also be another elif check in _ea_wrap_cython_operation, but it's not that it would result in less code or so, and since _ea_wrap_cython_operation already gets quite complicated, I think this separate function is good.

jorisvandenbossche · 2021-03-31T19:21:57Z

Not directly related to this PR, but something that I noticed in the implementation while reviewing this: the cummin/cummax "propagate" NAs:

In [13]: s = pd.Series([1, 3, np.nan, 2, 4])

In [14]: s
Out[14]: 
0    1.0
1    3.0
2    NaN
3    2.0
4    4.0
dtype: float64

In [15]: s.groupby([1] * 5).cummax()
Out[15]: 
0    1.0
1    3.0
2    NaN
3    3.0
4    4.0
dtype: float64

But since this is a cumulative maximum, that doesn't make much sense to me (the third value is the max of [1.0, 3.0, NaN], which should be 3.0 in our skipna=True logic).
So I would rather expect:

Out[15]: 
0    1.0
1    3.0
2    3.0
3    3.0
4    4.0
dtype: float64

Does somebody know if this is an explicit design? Or whether this was discussed before?

doc/source/whatsnew/v1.3.0.rst

jorisvandenbossche · 2021-04-15T19:10:58Z

@jreback can you take another look at this?

jorisvandenbossche · 2021-04-15T19:16:45Z

if the goal is for nullable types to become the default in pandas

This decision has not been made.

@jbrockmendel we have the intention, though. Our roadmap says on that topic: "We want to eventually make the new semantics the default." (at the level of abstraction of the "semantics", but currently for numeric dtypes that in practice means using masks)

mzeitlin11 · 2021-04-15T19:50:13Z

@jreback can you take another look at this?

Don't want to take up any more reviewer time just yet - want to take a quick crack at seeing what #40651 (comment) might look like first. Wouldn't want everyone to spend a bunch of time reviewing this, then have it end up being made obsolete by a more general solution.

jorisvandenbossche · 2021-04-16T20:35:03Z

(I personally think it makes more sense to merge this first, and then have a follow-up PR to refactor how things are called/passed in groupb/ops.py for masked arrays, focusing specifically on that without the cummin/cummax changes here. But leave it to @jbrockmendel and @jreback to state their preference)

jreback · 2021-04-16T21:28:15Z

yep onboard here to merge this (needs a rebase for some conflicts).

jbrockmendel · 2021-04-16T21:38:38Z

ditto

mzeitlin11 · 2021-04-18T00:56:24Z

have rebased

jbrockmendel · 2021-04-18T03:09:33Z

pandas/core/groupby/ops.py

+
+        if is_integer_dtype(arr.dtype) or is_bool_dtype(arr.dtype):
+            # IntegerArray or BooleanArray
+            arr = ensure_int_or_float(arr)


FWIW im planning to kill off this function; for EAs this is always just arr.to_numpy(dtype="float64", na_value=np.nan)

Thanks for bringing up - realized this whole condition can be simplified since we actually have an ndarray at this point

jreback · 2021-04-20T22:57:27Z

can you merge master once more (small conflict), but want to have CI run.

jreback · 2021-04-21T12:39:38Z

thanks @mzeitlin11

happy to take followups on cleaning

)

mzeitlin11 added 16 commits March 26, 2021 14:06

wip

7cd4dc3

wip

91984dc

wip

b371cc5

wip

69cce96

wip

dd7f324

wip

64680d4

wip

be16f65

wip

9442846

wip

f089175

wip

31409f8

wip

0c05f74

Merge remote-tracking branch 'origin/master' into perf/masked_cummin/max

18dcc94

wip

5c60a1f

PERF: use masked algo in groupby cummin and cummax

f0c27ce

Avoid mask copy

2fa80ad

Update whatsnew

280c7e5

mzeitlin11 added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance labels Mar 27, 2021

jbrockmendel reviewed Mar 28, 2021

View reviewed changes

pandas/core/groupby/ops.py Show resolved Hide resolved

jreback requested changes Mar 29, 2021

View reviewed changes

jorisvandenbossche reviewed Mar 31, 2021

View reviewed changes

mzeitlin11 added 3 commits March 31, 2021 20:13

Merge remote-tracking branch 'origin/master' into perf/masked_cummin/max

dca28cf

Merge fixup

7e2fbe0

Follow transpose

0ebb97a

Change random generation style

482a209

jbrockmendel reviewed Apr 13, 2021

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Show resolved Hide resolved

mzeitlin11 mentioned this pull request Apr 13, 2021

API: Should isna return a copy for nullable types? #40935

Closed

mzeitlin11 marked this pull request as draft April 15, 2021 19:50

mzeitlin11 added 2 commits April 17, 2021 20:40

Merge remote-tracking branch 'origin' into perf/masked_cummin/max

4e7404d

Use conditional instead of partial

251c02a

mzeitlin11 marked this pull request as ready for review April 18, 2021 00:55

jbrockmendel reviewed Apr 18, 2021

View reviewed changes

jbrockmendel mentioned this pull request Apr 18, 2021

BUG: groupby.rank with MaskedArray incorrect casting #41010

Merged

4 tasks

mzeitlin11 added 3 commits April 18, 2021 00:41

Remove ensure_int_or_float

3de7e5e

Remove unnecessary condition

237f86f

Merge remote-tracking branch 'origin' into perf/masked_cummin/max

a1b0c04

jbrockmendel mentioned this pull request Apr 20, 2021

DOC: more accurate wording in roadmap #41057

Merged

Merge remote-tracking branch 'origin' into perf/masked_cummin/max

5e1dac4

jreback added this to the 1.3 milestone Apr 21, 2021

jreback approved these changes Apr 21, 2021

View reviewed changes

jreback merged commit 5eade6e into pandas-dev:master Apr 21, 2021

mzeitlin11 deleted the perf/masked_cummin/max branch April 21, 2021 12:57

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

PERF/BUG: use masked algo in groupby cummin and cummax (pandas-dev#40651

de19779

)

mzeitlin11 mentioned this pull request May 31, 2021

BUG: groupby().agg( ) with min/max on Int64 leads to incorrect results #41743

Closed

1 task

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF/BUG: use masked algo in groupby cummin and cummax (pandas-dev#40651

88fdc75

)

	elif is_integer_dtype(dtype):
	# we use iNaT for the missing value on ints
	# so pre-convert to guard this condition
	if (values == iNaT).any():
	values = ensure_float64(values)

PERF/BUG: use masked algo in groupby cummin and cummax #40651

PERF/BUG: use masked algo in groupby cummin and cummax #40651

Conversation

mzeitlin11 commented Mar 27, 2021 • edited Loading

jbrockmendel commented Mar 27, 2021

mzeitlin11 commented Mar 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 31, 2021

jorisvandenbossche commented Apr 15, 2021

jorisvandenbossche commented Apr 15, 2021

mzeitlin11 commented Apr 15, 2021

jorisvandenbossche commented Apr 16, 2021

jreback commented Apr 16, 2021

jbrockmendel commented Apr 16, 2021

mzeitlin11 commented Apr 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 20, 2021

jreback commented Apr 21, 2021

mzeitlin11 commented Mar 27, 2021 •

edited

Loading