Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: windowing fixed and sliding #439

Merged
merged 11 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/fixed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/sliding.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 12 additions & 1 deletion docs/user-guide/user-defined-functions/reduce/reduce.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
# Reduce UDF

Work In Progress
Reduce is one of the most commonly used abstractions in a stream processing pipeline
to define aggregation functions on a stream of data. It is the reduce feature
that helps us solve problems like "performs a summary operation (such as
counting the number of occurrence of a key, yielding user login frequencies), etc."
Since the input an unbounded stream (with infinite entries), we need an
additional parameter to convert the unbounded problem to a bounded problem
and provide results on that. That bounding condition is "time", eg, "number
of users logged in per minute".
So while processing an unbounded stream of data, we need a way to group elements
into finite chunks using time. To build these chunks the reduce function is applied to the
set of records produced using the concept of [windowing](./windowing/windowing.md).

Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Fixed

Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second
windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all
the data for the corresponding period of time. It has a fixed size measured in time and does not
overlap. The element which belongs to one window will not belong to any other tumbling window.
For example, a window size of 20 seconds will include all entities of the stream which came in a
certain 20-second interval.

![plot](../../../../assets/fixed.png)

To enable Fixed widow, we use `fixed` under `window` section.

```yaml
window:
fixed:
length: duration
```

NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction
and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".

## Length

The `length` is the window size of the fixed window.

## Example

A 10-second window size can be defined as follows.
```yaml
window:
fixed:
length: 60s
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Sliding

Sliding window is similar to a Fixed windows, the size of the windows is measured in time and is fixed.
The important difference from the Fixed window is the fact that it allows an element to be present in
more than one window. The additional window slide parameter controls how frequently a sliding window
is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window
length.

![plot](../../../../assets/sliding.png)

```yaml
window:
sliding:
length: duration
slide: duration
```

NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction
and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".

## Length

The `length` is the window size of the fixed window.

## Slide

`slide` is the slide parameter that controls the frequency at which the sliding window is created.

## Example

To create a sliding window of length 1-minute which slides every 10-seconds, we can use the following
snippet.

```yaml
window:
sliding:
length: 60s
slide: 30s
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Windowing

In the world of data processing on an unbounded stream, Windowing is a concept
of grouping data using temporal boundaries. We use event-time to discover
temporal boundaries on an unbounded, infinite stream and [Watermark](../../../watermarks.md) to ensure
the datasets within the boundaries are complete. The [reduce](../reduce.md) is
applied on these grouped datasets.
Example, when we say, we want to find number of users online per minute, we use
windowing to group the users into one minute buckets.

The entirety of windowing is under the `groupBy` section.
```yaml
groupBy:
window:
...
keyed: ...
```

Under the `window` section we will define different types of windows.

## Window Types
Numaflow supports the following types of windows
* [Fixed](fixed.md)
* [Sliding](sliding.md)

## Non-Keyed v/s Keyed Windows

#### Non-Keyed
A non-keyed partition is a partition where the window is the boundary condition.
Data processing on a non-keyed partition cannot be scaled horizontally because
only one partition exists.

A non-keyed partition is usually used after aggregation and is hardly seen at
the head section of any data processing pipeline.
(There is a concept called Global Window where there is no windowing, but
let us table that for later).

### Keyed
A keyed partition is a partition where the partition boundary is a composite
key of both the window and the key from the payload (e.g., GROUP BY country,
where country names are the keys). Each smaller partition now has a complete
set of datasets for that key and boundary. The subdivision of dividing a huge
window-based partition into smaller partitions by adding keys along with the
window will help us horizontally scale the distribution.

Keyed partitions are heavily used to aggregate data and are frequently seen
throughout the processing pipeline. We could also convert and non-keyed problem
to a set of keyed problems and apply a non-keyed function at the end. This will
help solve the original problem in a scalable manner without affecting the
result's completeness and/or accuracy.

### Usage

Numaflow support both Keyed and Non-Keyed windows. We set `keyed` to either
`true` (keyed) or `false` (non-keyed). Please note that the non-keyed windows
are not horizontally scalable as mentioned above.

```yaml
groupBy:
window:
...
keyed: false # or true
```
4 changes: 4 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@ nav:
- Filter: "user-guide/user-defined-functions/map/builtin-functions/filter.md"
- Reduce:
- Overview: "user-guide/user-defined-functions/reduce/reduce.md"
- Windowing:
- Overview: "user-guide/user-defined-functions/reduce/windowing/windowing.md"
- Fixed: "user-guide/user-defined-functions/reduce/windowing/fixed.md"
- Sliding: "user-guide/user-defined-functions/reduce/windowing/sliding.md"
- user-guide/pipeline-tuning.md
- user-guide/conditional-forwarding.md
- user-guide/autoscaling.md
Expand Down