Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for discretizing temporal types, providing "tumbling window groupby" functionality #2107

Closed
jaychia opened this issue Apr 11, 2024 · 1 comment · Fixed by #2158
Assignees
Labels
p1 Important to tackle soon, but preemptable by p0

Comments

@jaychia
Copy link
Contributor

jaychia commented Apr 11, 2024

Is your feature request related to a problem? Please describe.

The overall goal of this issue is to enable grouping temporal rows by discrete non-overlapping "windows" (e.g. every minute starting from epoch time).

This can be enabled by providing a function that assigns each datetime value into discrete buckets ("windows") of datetimes, represented by the starting datetime of that window.

df["datetime"].dt.truncate("1 day", start=<defaults to epoch time>)

The above expression should return the timestamp of the start of the window that each row falls into.

>>> df
+-----------+
| date      |
+-----------+
| 01-01-24  |
| 02-01-24  |
| 03-02-24  |
| 04-02-24  |
| 05-02-24  |
+-----------+

>>> df.with_column("window", df["date"].dt.window(1, unit="month")).show()
+-----------+----------+
| date      | window   |
+-----------+----------+
| 01-01-24  | 01-01-24 |
| 02-01-24  | 01-01-24 |
| 03-02-24  | 01-02-24 |
| 04-02-24  | 01-02-24 |
| 05-02-24  | 01-02-24 |
+-----------+----------+

>>> df.groupby(df["date"].dt.window(1, unit="month")).count()
+-----------+----------+
| date      | count    |
+-----------+----------+
| 01-01-24  | 3        |
| 01-02-24  | 2        |
+-----------+----------+

See also:

@jaychia jaychia added p0 Priority 0 - to be addressed immediately p1 Important to tackle soon, but preemptable by p0 and removed p0 Priority 0 - to be addressed immediately labels Apr 11, 2024
@jaychia jaychia changed the title Add support for more sophisticated "bucketing" groupby mechanisms for temporal types Add support for "tumbling windows" of temporal types during groupbys Apr 11, 2024
@jaychia jaychia changed the title Add support for "tumbling windows" of temporal types during groupbys Add support for "tumbling windows" of temporal types Apr 11, 2024
@jaychia jaychia changed the title Add support for "tumbling windows" of temporal types Add support for discretizing temporal types Apr 11, 2024
@jaychia jaychia changed the title Add support for discretizing temporal types Add support for discretizing temporal types, providing "tumbling window groupby" functionality Apr 11, 2024
@samster25
Copy link
Member

I think we also may want to add explicit windowing syntactic sugar here instead of using the group-by method

colin-ho added a commit that referenced this issue Apr 26, 2024
Closes #2107

This PR adds a method to truncate the datetime column to the specified
interval e.g. "1 day".

Valid time units are: 'microsecond', 'millisecond', 'second', 'minute',
'hour', 'day', 'week'.
Optional start time for truncation. If provided, truncation will be done
from this start time, otherwise truncation will be done from the
beginning of the epoch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Important to tackle soon, but preemptable by p0
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants