BUG: groupby().rolling(freq) with monotonic dates within groups #46065

mroeschke · 2022-02-18T23:56:21Z

closes BUG: groupby().rolling("freq") validation raises even if dates are sorted within each group #46061
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/v1.4.2.rst file if fixing a bug or adding a new feature.

rhshadrach

lgtm, does the iteration in _validate_datetimelike_monotonic have a big performance impact?

mroeschke · 2022-02-19T20:53:01Z

There will be a slight performance impact during the validation (that can be a 1 time hit) that scales with the number of groups. Here's validating 3 vs 100 groups

In [1]:
   ...:         dates = date_range(start="2016-01-01 09:30:00", periods=20, freq="s")
   ...:         df = DataFrame(
   ...:             {
   ...:                 "A": [1] * 20 + [2] * 12 + [3] * 8,
   ...:                 "B": np.concatenate((dates, dates)),
   ...:                 "C": np.arange(40),
   ...:             }
   ...:         )

In [2]: %timeit df.groupby("A").rolling("4s", on="B")
764 µs ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]:
   ...:         dates = date_range(start="2016-01-01 09:30:00", periods=50, freq="s")
   ...:         df = DataFrame(
   ...:             {
   ...:                 "A": np.arange(100),
   ...:                 "B": np.concatenate((dates, dates)),
   ...:                 "C": np.arange(100),
   ...:             }
   ...:         )

In [7]: %timeit df.groupby("A").rolling("4s", on="B")
2.26 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

…onic dates within groups

rhshadrach · 2022-02-19T21:05:56Z

@mroeschke - It looks like the call to groupby(...).rolling(...) went from O(1) to O(n), is that right? I would describe that as being something significant. Perhaps it can't be avoided, but I'd like to have another look. In any case, I don't think we should be merging PRs (especially our own) when there are outstanding questions.

jreback · 2022-02-19T21:08:57Z

pandas/core/window/rolling.py

+        if self._on.hasnans:
+            raise ValueError(f"{on} must not have any NaT values.")
+        for group_indices in self._grouper.indices.values():
+            group_on = self._on.take(group_indices)


yeah this is materializing things
prob not a great idea

self._on is already materialized in __init__. Is that what is meant by materialized?

jreback · 2022-02-19T21:09:34Z

@mroeschke there was no hurry on this one

…ps (pandas-dev#46065)" This reverts commit ad0f1bf.

mroeschke · 2022-02-19T23:20:13Z

Sorry I was too eager on the LGTM. #46078 reverts this PR.

This validation is O(number of groups). Happy to have a suggestion regarding a more performant solution.

…ps (#46065)" (#46078) This reverts commit ad0f1bf.

simonjayhawkins · 2022-03-18T13:41:20Z

This PR was reverted yet did fix a regression in 1.4.0.

In the absence of further PRs submitted to fix the reported regression, I propose undoing the revert of this PR to include this in 1.4.2 as performance is secondary to correctness?

this worked (for the sorted groups case) in 1.3.5 so maybe a fix that only resolves that scenario would be more performant?

If we merge this and are subsequently able to improve performance, we could maybe also consider backporting the performance improvement.

cpcloud · 2022-03-29T16:16:33Z

We're hitting this issue in ibis in a few of our tests cases and it would be great if the revert of this could be undone.

rhshadrach · 2022-03-29T19:44:16Z

I took another look, I'm not in love with the amount of compute done in .groupby.rolling (I think this objects should be lightweight), but I don't see any immediate approaches that would be better to fix the regression. +1 on applying this patch.

For better performance, can we leverage Index? Adapting the example from #46065 (comment)

def make_df(size):
    dates = pd.date_range(start="2016-01-01 09:30:00", periods=size, freq="s")
    df = pd.DataFrame(
        {
            "A": range(2 * size),
            "B": np.concatenate((dates, dates)),
            "C": np.arange(2 * size),
        }
    )
    return df

df = make_df(5000)
%timeit df.groupby("A").rolling("4s", on="B")
df = make_df(50000)
%timeit df.groupby("A").rolling("4s", on="B")

I get

179 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.82 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Profiling this shows almost all the time spent checking the monotonicity. However

df = make_df(50000)
%timeit df.set_index(['A', 'B']).index.is_monotonic_increasing

Results in True and I get 11.4 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mroeschke · 2022-03-30T02:28:06Z

%timeit df.set_index(['A', 'B']).index.is_monotonic_increasing

Interestingly, I guess for a MultiIndex only one of the levels needs to be monotonically increasing for it to return True

In [3]: df.set_index(['A', 'B']).index
Out[3]:
MultiIndex([(   0, '2016-01-01 09:30:00'),
            (   1, '2016-01-01 09:30:01'),
            (   2, '2016-01-01 09:30:02'),
            (   3, '2016-01-01 09:30:03'),
            (   4, '2016-01-01 09:30:04'),
            (   5, '2016-01-01 09:30:05'),
            (   6, '2016-01-01 09:30:06'),
            (   7, '2016-01-01 09:30:07'),
            (   8, '2016-01-01 09:30:08'),
            (   9, '2016-01-01 09:30:09'),
            ...
            (9990, '2016-01-01 10:53:10'),
            (9991, '2016-01-01 10:53:11'),
            (9992, '2016-01-01 10:53:12'),
            (9993, '2016-01-01 10:53:13'),
            (9994, '2016-01-01 10:53:14'),
            (9995, '2016-01-01 10:53:15'),
            (9996, '2016-01-01 10:53:16'),
            (9997, '2016-01-01 10:53:17'),
            (9998, '2016-01-01 10:53:18'),
            (9999, '2016-01-01 10:53:19')],
           names=['A', 'B'], length=10000)

In [4]: df.set_index(['A', 'B']).index.get_level_values("A").is_monotonic_increasing
Out[4]: True

# This is the rolling axis (on="B")
In [5]: df.set_index(['A', 'B']).index.get_level_values("B").is_monotonic_increasing
Out[5]: False

Unfortunately we can't use .is_monotonic_* outright because it's the monotonicity of the entire axis while we want to know if each group in the axis is monotonic. e.g.

In [6]: print(all(group.B.is_monotonic_increasing for key, group in df.groupby("A")))
True

#46567)

…s-dev#46065 (pandas-dev#46567) (cherry picked from commit d2aa44f)

… monotonic dates within groups #46065 ) (#46592) * ENH: Improve error message when rolling over NaT value (#46087) * BUG: groupby().rolling(freq) with monotonic dates within groups #46065 (#46567) (cherry picked from commit d2aa44f) Co-authored-by: Matthew Roeschke <[email protected]>

…as-dev#46065)

…ps (pandas-dev#46065)" (pandas-dev#46078) This reverts commit ad0f1bf.

…s-dev#46065 (pandas-dev#46567)

BUG: groupby().rolling(freq) with monotonic dates within groups

25a09f4

mroeschke added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding labels Feb 18, 2022

mroeschke added this to the 1.4.2 milestone Feb 19, 2022

rhshadrach reviewed Feb 19, 2022

View reviewed changes

mroeschke merged commit ad0f1bf into pandas-dev:main Feb 19, 2022

meeseeksmachine mentioned this pull request Feb 19, 2022

Backport PR #46065 on branch 1.4.x (BUG: groupby().rolling(freq) with monotonic dates within groups) #46076

Closed

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 19, 2022

Backport PR pandas-dev#46065: BUG: groupby().rolling(freq) with monot…

7e68fe6

…onic dates within groups

mroeschke deleted the bug/gb_roll_mon branch February 19, 2022 20:53

jreback reviewed Feb 19, 2022

View reviewed changes

mroeschke added a commit to mroeschke/pandas that referenced this pull request Feb 19, 2022

Revert "BUG: groupby().rolling(freq) with monotonic dates within grou…

feae8d1

…ps (pandas-dev#46065)" This reverts commit ad0f1bf.

simonjayhawkins removed this from the 1.4.2 milestone Feb 20, 2022

simonjayhawkins pushed a commit that referenced this pull request Feb 20, 2022

Revert "BUG: groupby().rolling(freq) with monotonic dates within grou…

afc06e7

…ps (#46065)" (#46078) This reverts commit ad0f1bf.

This was referenced Feb 20, 2022

Revert "BUG: groupby().rolling(freq) with monotonic dates within groups (#46065)" #46078

Merged

BUG: groupby().rolling("freq") validation raises even if dates are sorted within each group #46061

Closed

mroeschke restored the bug/gb_roll_mon branch March 30, 2022 01:52

mroeschke mentioned this pull request Mar 30, 2022

BUG: groupby().rolling(freq) with monotonic dates within groups #46065 #46567

Merged

4 tasks

jreback pushed a commit that referenced this pull request Mar 31, 2022

BUG: groupby().rolling(freq) with monotonic dates within groups #46065 (

d2aa44f

#46567)

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Apr 1, 2022

BUG: groupby().rolling(freq) with monotonic dates within groups panda…

8469157

…s-dev#46065 (pandas-dev#46567) (cherry picked from commit d2aa44f)

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

BUG: groupby().rolling(freq) with monotonic dates within groups (pand…

3388bb0

…as-dev#46065)

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

Revert "BUG: groupby().rolling(freq) with monotonic dates within grou…

f698dab

…ps (pandas-dev#46065)" (pandas-dev#46078) This reverts commit ad0f1bf.

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

BUG: groupby().rolling(freq) with monotonic dates within groups panda…

7537078

…s-dev#46065 (pandas-dev#46567)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby().rolling(freq) with monotonic dates within groups #46065

BUG: groupby().rolling(freq) with monotonic dates within groups #46065

mroeschke commented Feb 18, 2022

rhshadrach left a comment

mroeschke commented Feb 19, 2022

rhshadrach commented Feb 19, 2022

jreback Feb 19, 2022

mroeschke Feb 19, 2022 •

edited

Loading

jreback commented Feb 19, 2022

mroeschke commented Feb 19, 2022 •

edited

Loading

simonjayhawkins commented Mar 18, 2022

cpcloud commented Mar 29, 2022

rhshadrach commented Mar 29, 2022 •

edited

Loading

mroeschke commented Mar 30, 2022

BUG: groupby().rolling(freq) with monotonic dates within groups #46065

BUG: groupby().rolling(freq) with monotonic dates within groups #46065

Conversation

mroeschke commented Feb 18, 2022

rhshadrach left a comment

Choose a reason for hiding this comment

mroeschke commented Feb 19, 2022

rhshadrach commented Feb 19, 2022

jreback Feb 19, 2022

Choose a reason for hiding this comment

mroeschke Feb 19, 2022 • edited Loading

Choose a reason for hiding this comment

jreback commented Feb 19, 2022

mroeschke commented Feb 19, 2022 • edited Loading

simonjayhawkins commented Mar 18, 2022

cpcloud commented Mar 29, 2022

rhshadrach commented Mar 29, 2022 • edited Loading

mroeschke commented Mar 30, 2022

mroeschke Feb 19, 2022 •

edited

Loading

mroeschke commented Feb 19, 2022 •

edited

Loading

rhshadrach commented Mar 29, 2022 •

edited

Loading