-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong subgroup labels using groupby_rolling
with by
set and positive offset
#9973
Closed
2 tasks done
Labels
Comments
Yeah I'm actually getting a segfault when I run it via python: from datetime import date
import polars as pl
from polars import col
dt1 = date(2001, 1, 1)
dt2 = date(2001, 1, 2)
data = pl.DataFrame({
"id": ["A", "A", "B", "B", "C", "C"],
"date": [dt1, dt2, dt1, dt2, dt1, dt2],
"value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
}).sort(by=["id", "date"])
print(data)
result = data.groupby_rolling(
index_column="date",
by="id",
period="2d",
offset="1d",
closed="left",
check_sorted=True,
).agg(col("value"))
print(result)
|
I just encountered this issue. Interestingly, the segfault for me is dependent on whether or not I print the original dataframe first or not. For example, running the below script import polars as pl
from datetime import datetime, timedelta
df = pl.DataFrame({
'subject_id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'timestamp': [
datetime(2020, 1, 1),
datetime(2020, 1, 2),
datetime(2020, 1, 5),
datetime(2020, 1, 16),
datetime(2020, 3, 1),
datetime(2020, 1, 3),
datetime(2020, 1, 4),
datetime(2020, 1, 9),
datetime(2020, 1, 11),
],
'event_A': [True, False, False, True, False, False, True, True, False],
'event_B': [False, True, True, False, True, False, False, False, True],
'event_C': [True, True, True, False, False, False, False, False, True],
})
print("df:")
print(df)
print("""
df
.groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
.agg(pl.col('event_A').sum())
""")
print(
df
.groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
.agg(pl.col('event_A').sum())
) Returns
But running this script import polars as pl
from datetime import datetime, timedelta
df = pl.DataFrame({
'subject_id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'timestamp': [
datetime(2020, 1, 1),
datetime(2020, 1, 2),
datetime(2020, 1, 5),
datetime(2020, 1, 16),
datetime(2020, 3, 1),
datetime(2020, 1, 3),
datetime(2020, 1, 4),
datetime(2020, 1, 9),
datetime(2020, 1, 11),
],
'event_A': [True, False, False, True, False, False, True, True, False],
'event_B': [False, True, True, False, True, False, False, False, True],
'event_C': [True, True, True, False, False, False, False, False, True],
})
print("""
df
.groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
.agg(pl.col('event_A').sum())
""")
print(
df
.groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
.agg(pl.col('event_A').sum())
) Returns
And in case it helps in debugging, this is running polars python version 0.18.4. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Output:
Issue description
The subgroup labels (
id
column) are incorrectly assigned ifgroupby_rolling
is used with positiveoffset
. However, this only happens if the subgroup is empty.Hints:
After stepping through the code, I ended up in
update_subgroups_idx()
(link), which does an uncheckedget
to obtain the first index of a subgroup. The problem manifests in case the subgroup is empty (len
is 0) and an out-of-bound accessbase_g
is made. Replacing the linewith:
solved the problem. However, I'm not sure if this change has other implication or if it just fixes symptoms.
Expected behavior
The expected output for the example above would be:
Installed versions
[dependencies]
polars = { version = "0.31.1", features = ["lazy", "dynamic_groupby"] }
The text was updated successfully, but these errors were encountered: