-
Notifications
You must be signed in to change notification settings - Fork 113
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc: windowing fixed and sliding (#439)
Signed-off-by: Vigith Maurice <[email protected]> Signed-off-by: Yashash H L <[email protected]> Co-authored-by: Yashash H L <[email protected]>
- Loading branch information
Showing
7 changed files
with
161 additions
and
1 deletion.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,14 @@ | ||
# Reduce UDF | ||
|
||
Work In Progress | ||
Reduce is one of the most commonly used abstractions in a stream processing pipeline | ||
to define aggregation functions on a stream of data. It is the reduce feature | ||
that helps us solve problems like "performs a summary operation (such as | ||
counting the number of occurrence of a key, yielding user login frequencies), etc." | ||
Since the input an unbounded stream (with infinite entries), we need an | ||
additional parameter to convert the unbounded problem to a bounded problem | ||
and provide results on that. That bounding condition is "time", eg, "number | ||
of users logged in per minute". | ||
So while processing an unbounded stream of data, we need a way to group elements | ||
into finite chunks using time. To build these chunks the reduce function is applied to the | ||
set of records produced using the concept of [windowing](./windowing/windowing.md). | ||
|
37 changes: 37 additions & 0 deletions
37
docs/user-guide/user-defined-functions/reduce/windowing/fixed.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Fixed | ||
|
||
Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second | ||
windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all | ||
the data for the corresponding period of time. It has a fixed size measured in time and does not | ||
overlap. The element which belongs to one window will not belong to any other tumbling window. | ||
For example, a window size of 20 seconds will include all entities of the stream which came in a | ||
certain 20-second interval. | ||
|
||
![plot](../../../../assets/fixed.png) | ||
|
||
To enable Fixed widow, we use `fixed` under `window` section. | ||
|
||
```yaml | ||
groupBy: | ||
window: | ||
fixed: | ||
length: duration | ||
``` | ||
NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction | ||
and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". | ||
## Length | ||
The `length` is the window size of the fixed window. | ||
|
||
## Example | ||
|
||
A 10-second window size can be defined as follows. | ||
```yaml | ||
groupBy: | ||
window: | ||
fixed: | ||
length: 60s | ||
``` | ||
|
42 changes: 42 additions & 0 deletions
42
docs/user-guide/user-defined-functions/reduce/windowing/sliding.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Sliding | ||
|
||
Sliding window is similar to a Fixed windows, the size of the windows is measured in time and is fixed. | ||
The important difference from the Fixed window is the fact that it allows an element to be present in | ||
more than one window. The additional window slide parameter controls how frequently a sliding window | ||
is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window | ||
length. | ||
|
||
![plot](../../../../assets/sliding.png) | ||
|
||
```yaml | ||
groupBy: | ||
window: | ||
sliding: | ||
length: duration | ||
slide: duration | ||
``` | ||
NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction | ||
and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". | ||
## Length | ||
The `length` is the window size of the fixed window. | ||
|
||
## Slide | ||
|
||
`slide` is the slide parameter that controls the frequency at which the sliding window is created. | ||
|
||
## Example | ||
|
||
To create a sliding window of length 1-minute which slides every 10-seconds, we can use the following | ||
snippet. | ||
|
||
```yaml | ||
groupBy: | ||
window: | ||
sliding: | ||
length: 60s | ||
slide: 30s | ||
``` | ||
|
66 changes: 66 additions & 0 deletions
66
docs/user-guide/user-defined-functions/reduce/windowing/windowing.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Windowing | ||
|
||
In the world of data processing on an unbounded stream, Windowing is a concept | ||
of grouping data using temporal boundaries. We use event-time to discover | ||
temporal boundaries on an unbounded, infinite stream and [Watermark](../../../watermarks.md) to ensure | ||
the datasets within the boundaries are complete. The [reduce](../reduce.md) is | ||
applied on these grouped datasets. | ||
Example, when we say, we want to find number of users online per minute, we use | ||
windowing to group the users into one minute buckets. | ||
|
||
The entirety of windowing is under the `groupBy` section. | ||
```yaml | ||
groupBy: | ||
window: | ||
... | ||
keyed: ... | ||
``` | ||
Since a window can be [Non-Keyed v/s Keyed](#non-keyed-vs-keyed-windows), | ||
we have an explicit field called `keyed`to differentiate between both (see below). | ||
|
||
Under the `window` section we will define different types of windows. | ||
|
||
## Window Types | ||
Numaflow supports the following types of windows | ||
* [Fixed](fixed.md) | ||
* [Sliding](sliding.md) | ||
|
||
## Non-Keyed v/s Keyed Windows | ||
|
||
#### Non-Keyed | ||
A non-keyed partition is a partition where the window is the boundary condition. | ||
Data processing on a non-keyed partition cannot be scaled horizontally because | ||
only one partition exists. | ||
|
||
A non-keyed partition is usually used after aggregation and is hardly seen at | ||
the head section of any data processing pipeline. | ||
(There is a concept called Global Window where there is no windowing, but | ||
let us table that for later). | ||
|
||
### Keyed | ||
A keyed partition is a partition where the partition boundary is a composite | ||
key of both the window and the key from the payload (e.g., GROUP BY country, | ||
where country names are the keys). Each smaller partition now has a complete | ||
set of datasets for that key and boundary. The subdivision of dividing a huge | ||
window-based partition into smaller partitions by adding keys along with the | ||
window will help us horizontally scale the distribution. | ||
|
||
Keyed partitions are heavily used to aggregate data and are frequently seen | ||
throughout the processing pipeline. We could also convert and non-keyed problem | ||
to a set of keyed problems and apply a non-keyed function at the end. This will | ||
help solve the original problem in a scalable manner without affecting the | ||
result's completeness and/or accuracy. | ||
|
||
### Usage | ||
|
||
Numaflow support both Keyed and Non-Keyed windows. We set `keyed` to either | ||
`true` (keyed) or `false` (non-keyed). Please note that the non-keyed windows | ||
are not horizontally scalable as mentioned above. | ||
|
||
```yaml | ||
groupBy: | ||
window: | ||
... | ||
keyed: false # or true | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters