Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby().rolling(freq) with monotonic dates within groups #46065

Merged
merged 1 commit into from
Feb 19, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding labels Feb 18, 2022
@mroeschke mroeschke added this to the 1.4.2 milestone Feb 19, 2022
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, does the iteration in _validate_datetimelike_monotonic have a big performance impact?

@mroeschke
Copy link
Member Author

There will be a slight performance impact during the validation (that can be a 1 time hit) that scales with the number of groups. Here's validating 3 vs 100 groups

In [1]:
   ...:         dates = date_range(start="2016-01-01 09:30:00", periods=20, freq="s")
   ...:         df = DataFrame(
   ...:             {
   ...:                 "A": [1] * 20 + [2] * 12 + [3] * 8,
   ...:                 "B": np.concatenate((dates, dates)),
   ...:                 "C": np.arange(40),
   ...:             }
   ...:         )

In [2]: %timeit df.groupby("A").rolling("4s", on="B")
764 µs ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]:
   ...:         dates = date_range(start="2016-01-01 09:30:00", periods=50, freq="s")
   ...:         df = DataFrame(
   ...:             {
   ...:                 "A": np.arange(100),
   ...:                 "B": np.concatenate((dates, dates)),
   ...:                 "C": np.arange(100),
   ...:             }
   ...:         )

In [7]: %timeit df.groupby("A").rolling("4s", on="B")
2.26 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mroeschke mroeschke merged commit ad0f1bf into pandas-dev:main Feb 19, 2022
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 19, 2022
@mroeschke mroeschke deleted the bug/gb_roll_mon branch February 19, 2022 20:53
@rhshadrach
Copy link
Member

@mroeschke - It looks like the call to groupby(...).rolling(...) went from O(1) to O(n), is that right? I would describe that as being something significant. Perhaps it can't be avoided, but I'd like to have another look. In any case, I don't think we should be merging PRs (especially our own) when there are outstanding questions.

if self._on.hasnans:
raise ValueError(f"{on} must not have any NaT values.")
for group_indices in self._grouper.indices.values():
group_on = self._on.take(group_indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is materializing things
prob not a great idea

Copy link
Member Author

@mroeschke mroeschke Feb 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._on is already materialized in __init__. Is that what is meant by materialized?

@jreback
Copy link
Contributor

jreback commented Feb 19, 2022

@mroeschke there was no hurry on this one

mroeschke added a commit to mroeschke/pandas that referenced this pull request Feb 19, 2022
@mroeschke
Copy link
Member Author

mroeschke commented Feb 19, 2022

Sorry I was too eager on the LGTM. #46078 reverts this PR.

This validation is O(number of groups). Happy to have a suggestion regarding a more performant solution.

@simonjayhawkins
Copy link
Member

This PR was reverted yet did fix a regression in 1.4.0.

In the absence of further PRs submitted to fix the reported regression, I propose undoing the revert of this PR to include this in 1.4.2 as performance is secondary to correctness?

this worked (for the sorted groups case) in 1.3.5 so maybe a fix that only resolves that scenario would be more performant?

If we merge this and are subsequently able to improve performance, we could maybe also consider backporting the performance improvement.

@cpcloud
Copy link
Member

cpcloud commented Mar 29, 2022

We're hitting this issue in ibis in a few of our tests cases and it would be great if the revert of this could be undone.

@rhshadrach
Copy link
Member

rhshadrach commented Mar 29, 2022

I took another look, I'm not in love with the amount of compute done in .groupby.rolling (I think this objects should be lightweight), but I don't see any immediate approaches that would be better to fix the regression. +1 on applying this patch.

For better performance, can we leverage Index? Adapting the example from #46065 (comment)

def make_df(size):
    dates = pd.date_range(start="2016-01-01 09:30:00", periods=size, freq="s")
    df = pd.DataFrame(
        {
            "A": range(2 * size),
            "B": np.concatenate((dates, dates)),
            "C": np.arange(2 * size),
        }
    )
    return df

df = make_df(5000)
%timeit df.groupby("A").rolling("4s", on="B")
df = make_df(50000)
%timeit df.groupby("A").rolling("4s", on="B")

I get

179 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.82 s ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Profiling this shows almost all the time spent checking the monotonicity. However

df = make_df(50000)
%timeit df.set_index(['A', 'B']).index.is_monotonic_increasing

Results in True and I get 11.4 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mroeschke
Copy link
Member Author

%timeit df.set_index(['A', 'B']).index.is_monotonic_increasing

Interestingly, I guess for a MultiIndex only one of the levels needs to be monotonically increasing for it to return True

In [3]: df.set_index(['A', 'B']).index
Out[3]:
MultiIndex([(   0, '2016-01-01 09:30:00'),
            (   1, '2016-01-01 09:30:01'),
            (   2, '2016-01-01 09:30:02'),
            (   3, '2016-01-01 09:30:03'),
            (   4, '2016-01-01 09:30:04'),
            (   5, '2016-01-01 09:30:05'),
            (   6, '2016-01-01 09:30:06'),
            (   7, '2016-01-01 09:30:07'),
            (   8, '2016-01-01 09:30:08'),
            (   9, '2016-01-01 09:30:09'),
            ...
            (9990, '2016-01-01 10:53:10'),
            (9991, '2016-01-01 10:53:11'),
            (9992, '2016-01-01 10:53:12'),
            (9993, '2016-01-01 10:53:13'),
            (9994, '2016-01-01 10:53:14'),
            (9995, '2016-01-01 10:53:15'),
            (9996, '2016-01-01 10:53:16'),
            (9997, '2016-01-01 10:53:17'),
            (9998, '2016-01-01 10:53:18'),
            (9999, '2016-01-01 10:53:19')],
           names=['A', 'B'], length=10000)

In [4]: df.set_index(['A', 'B']).index.get_level_values("A").is_monotonic_increasing
Out[4]: True

# This is the rolling axis (on="B")
In [5]: df.set_index(['A', 'B']).index.get_level_values("B").is_monotonic_increasing
Out[5]: False

Unfortunately we can't use .is_monotonic_* outright because it's the monotonicity of the entire axis while we want to know if each group in the axis is monotonic. e.g.

In [6]: print(all(group.B.is_monotonic_increasing for key, group in df.groupby("A")))
True

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Apr 1, 2022
simonjayhawkins added a commit that referenced this pull request Apr 1, 2022
… monotonic dates within groups #46065 ) (#46592)

* ENH: Improve error message when rolling over NaT value (#46087)

* BUG: groupby().rolling(freq) with monotonic dates within groups #46065  (#46567)

(cherry picked from commit d2aa44f)

Co-authored-by: Matthew Roeschke <[email protected]>
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby().rolling("freq") validation raises even if dates are sorted within each group
5 participants