Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong subgroup labels using groupby_rolling with by set and positive offset #9973

Closed
2 tasks done
ronaldrichter opened this issue Jul 19, 2023 · 2 comments · Fixed by #10109
Closed
2 tasks done

Wrong subgroup labels using groupby_rolling with by set and positive offset #9973

ronaldrichter opened this issue Jul 19, 2023 · 2 comments · Fixed by #10109
Labels
A-temporal Area: date/time functionality bug Something isn't working rust Related to Rust Polars

Comments

@ronaldrichter
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

use polars::export::chrono::{NaiveDate, Days};
use polars::{prelude::*, lazy::dsl::col};

fn main() {
    let dt1 = NaiveDate::from_ymd_opt(2001, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap();
    let dt2 = dt1 + Days::new(1);

    let data = DataFrame::new(vec![
        Series::new("id",    &["A", "A", "B", "B", "C", "C"]),
        Series::new("date",  &[dt1, dt2, dt1, dt2, dt1, dt2]),
        Series::new("value", &[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]),
    ]).unwrap();

    println!("data: {data}");

    let result = data.clone().lazy()
        .groupby_rolling(col("date"), [col("id")], RollingGroupOptions { 
            index_column: "date".into(), 
            period: Duration::parse("2d"),
            offset: Duration::parse("1d"), 
            closed_window: ClosedWindow::Left, 
            check_sorted: true,
        })
        .agg([col("value")])
        .collect()
        .unwrap();

    println!("result: {result}");
}

Output:

data: shape: (6, 3)
┌─────┬─────────────────────┬───────┐
│ id  ┆ date                ┆ value │
│ --- ┆ ---                 ┆ ---   │
│ str ┆ datetime[ms]        ┆ f64   │
╞═════╪═════════════════════╪═══════╡
│ A   ┆ 2001-01-01 00:00:00 ┆ 1.0   │
│ A   ┆ 2001-01-02 00:00:00 ┆ 2.0   │
│ B   ┆ 2001-01-01 00:00:00 ┆ 3.0   │
│ B   ┆ 2001-01-02 00:00:00 ┆ 4.0   │
│ C   ┆ 2001-01-01 00:00:00 ┆ 5.0   │
│ C   ┆ 2001-01-02 00:00:00 ┆ 6.0   │
└─────┴─────────────────────┴───────┘
result: shape: (6, 3)
┌─────┬─────────────────────┬───────────┐
│ id  ┆ date                ┆ value     │
│ --- ┆ ---                 ┆ ---       │
│ str ┆ datetime[ms]        ┆ list[f64] │
╞═════╪═════════════════════╪═══════════╡
│ A   ┆ 2001-01-01 00:00:00 ┆ [2.0]     │
│ A   ┆ 2001-01-02 00:00:00 ┆ []        │
│ B   ┆ 2001-01-01 00:00:00 ┆ [4.0]     │
│ A   ┆ 2001-01-02 00:00:00 ┆ []        │
│ C   ┆ 2001-01-01 00:00:00 ┆ [6.0]     │
│ A   ┆ 2001-01-02 00:00:00 ┆ []        │
└─────┴─────────────────────┴───────────┘

Issue description

The subgroup labels (id column) are incorrectly assigned if groupby_rolling is used with positive offset. However, this only happens if the subgroup is empty.

Hints:

After stepping through the code, I ended up in update_subgroups_idx() (link), which does an unchecked get to obtain the first index of a subgroup. The problem manifests in case the subgroup is empty (len is 0) and an out-of-bound access base_g is made. Replacing the line

let new_first = unsafe { *base_g.1.get_unchecked(first as usize) };

with:

let new_first = unsafe { *base_g.1.get_unchecked((first as usize).min(base_g.1.len() - 1)) };

solved the problem. However, I'm not sure if this change has other implication or if it just fixes symptoms.

Expected behavior

The expected output for the example above would be:

result: shape: (6, 3)
┌─────┬─────────────────────┬───────────┐
│ id  ┆ date                ┆ value     │
│ --- ┆ ---                 ┆ ---       │
│ str ┆ datetime[ms]        ┆ list[f64] │
╞═════╪═════════════════════╪═══════════╡
│ A   ┆ 2001-01-01 00:00:00 ┆ [2.0]     │
│ A   ┆ 2001-01-02 00:00:00 ┆ []        │
│ B   ┆ 2001-01-01 00:00:00 ┆ [4.0]     │
│ B   ┆ 2001-01-02 00:00:00 ┆ []        │
│ C   ┆ 2001-01-01 00:00:00 ┆ [6.0]     │
│ C   ┆ 2001-01-02 00:00:00 ┆ []        │
└─────┴─────────────────────┴───────────┘

Installed versions

[dependencies]
polars = { version = "0.31.1", features = ["lazy", "dynamic_groupby"] }

@ronaldrichter ronaldrichter added bug Something isn't working rust Related to Rust Polars labels Jul 19, 2023
@MarcoGorelli MarcoGorelli added the A-temporal Area: date/time functionality label Jul 19, 2023
@mcrumiller
Copy link
Contributor

Yeah I'm actually getting a segfault when I run it via python:

from datetime import date

import polars as pl
from polars import col

dt1 = date(2001, 1, 1)
dt2 = date(2001, 1, 2)

data = pl.DataFrame({
    "id": ["A", "A", "B", "B", "C", "C"],
    "date": [dt1, dt2, dt1, dt2, dt1, dt2],
    "value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
}).sort(by=["id", "date"])

print(data)

result = data.groupby_rolling(
    index_column="date",
    by="id",
    period="2d",
    offset="1d",
    closed="left",
    check_sorted=True,
).agg(col("value"))

print(result)
shape: (6, 3)
┌─────┬────────────┬───────┐
│ id  ┆ date       ┆ value │
│ --- ┆ ---        ┆ ---   │
│ str ┆ date       ┆ f64   │
╞═════╪════════════╪═══════╡
│ A   ┆ 2001-01-01 ┆ 1.0   │
│ A   ┆ 2001-01-02 ┆ 2.0   │
│ B   ┆ 2001-01-01 ┆ 3.0   │
│ B   ┆ 2001-01-02 ┆ 4.0   │
│ C   ┆ 2001-01-01 ┆ 5.0   │
│ C   ┆ 2001-01-02 ┆ 6.0   │
└─────┴────────────┴───────┘
Segmentation fault

@mmcdermott
Copy link

I just encountered this issue. Interestingly, the segfault for me is dependent on whether or not I print the original dataframe first or not. For example, running the below script

import polars as pl
from datetime import datetime, timedelta

df = pl.DataFrame({
    'subject_id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
    'timestamp': [
        datetime(2020, 1, 1),
        datetime(2020, 1, 2),
        datetime(2020, 1, 5),
        datetime(2020, 1, 16),
        datetime(2020, 3, 1),
        datetime(2020, 1, 3),
        datetime(2020, 1, 4),
        datetime(2020, 1, 9),
        datetime(2020, 1, 11),
    ],
    'event_A': [True,  False, False,  True, False, False,  True,  True, False],
    'event_B': [False,  True,  True, False,  True, False, False, False, True],
    'event_C': [True,   True,  True, False, False, False, False, False, True],
})

print("df:")
print(df)


print("""
    df
    .groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
    .agg(pl.col('event_A').sum())
""")
print(
    df
    .groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
    .agg(pl.col('event_A').sum())
)

Returns

df:
shape: (9, 5)
┌────────────┬─────────────────────┬─────────┬─────────┬─────────┐
│ subject_id ┆ timestamp           ┆ event_A ┆ event_B ┆ event_C │
│ ---        ┆ ---                 ┆ ---     ┆ ---     ┆ ---     │
│ i64        ┆ datetime[μs]        ┆ bool    ┆ bool    ┆ bool    │
╞════════════╪═════════════════════╪═════════╪═════════╪═════════╡
│ 1          ┆ 2020-01-01 00:00:00 ┆ true    ┆ false   ┆ true    │
│ 1          ┆ 2020-01-02 00:00:00 ┆ false   ┆ true    ┆ true    │
│ 1          ┆ 2020-01-05 00:00:00 ┆ false   ┆ true    ┆ true    │
│ 1          ┆ 2020-01-16 00:00:00 ┆ true    ┆ false   ┆ false   │
│ 1          ┆ 2020-03-01 00:00:00 ┆ false   ┆ true    ┆ false   │
│ 2          ┆ 2020-01-03 00:00:00 ┆ false   ┆ false   ┆ false   │
│ 2          ┆ 2020-01-04 00:00:00 ┆ true    ┆ false   ┆ false   │
│ 2          ┆ 2020-01-09 00:00:00 ┆ true    ┆ false   ┆ false   │
│ 2          ┆ 2020-01-11 00:00:00 ┆ false   ┆ true    ┆ true    │
└────────────┴─────────────────────┴─────────┴─────────┴─────────┘

    df
    .groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
    .agg(pl.col('event_A').sum())

Segmentation fault (core dumped)

But running this script

import polars as pl
from datetime import datetime, timedelta

df = pl.DataFrame({
    'subject_id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
    'timestamp': [
        datetime(2020, 1, 1),
        datetime(2020, 1, 2),
        datetime(2020, 1, 5),
        datetime(2020, 1, 16),
        datetime(2020, 3, 1),
        datetime(2020, 1, 3),
        datetime(2020, 1, 4),
        datetime(2020, 1, 9),
        datetime(2020, 1, 11),
    ],
    'event_A': [True,  False, False,  True, False, False,  True,  True, False],
    'event_B': [False,  True,  True, False,  True, False, False, False, True],
    'event_C': [True,   True,  True, False, False, False, False, False, True],
})

print("""
    df
    .groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
    .agg(pl.col('event_A').sum())
""")
print(
    df
    .groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
    .agg(pl.col('event_A').sum())
)

Returns

    df
    .groupby_rolling('timestamp', period='7d', offset=timedelta(days=0), by='subject_id', closed='right')
    .agg(pl.col('event_A').sum())

shape: (9, 3)
┌────────────┬─────────────────────┬─────────┐
│ subject_id ┆ timestamp           ┆ event_A │
│ ---        ┆ ---                 ┆ ---     │
│ i64        ┆ datetime[μs]        ┆ u32     │
╞════════════╪═════════════════════╪═════════╡
│ 1          ┆ 2020-01-01 00:00:00 ┆ 0       │
│ 1          ┆ 2020-01-02 00:00:00 ┆ 0       │
│ 1          ┆ 2020-01-05 00:00:00 ┆ null    │
│ 1          ┆ 2020-01-16 00:00:00 ┆ null    │
│ 1          ┆ 2020-03-01 00:00:00 ┆ null    │
│ 2          ┆ 2020-01-03 00:00:00 ┆ 2       │
│ 2          ┆ 2020-01-04 00:00:00 ┆ 1       │
│ 2          ┆ 2020-01-09 00:00:00 ┆ 0       │
│ 1          ┆ 2020-01-11 00:00:00 ┆ null    │
└────────────┴─────────────────────┴─────────┘

And in case it helps in debugging, this is running polars python version 0.18.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality bug Something isn't working rust Related to Rust Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants