doc: windowing fixed and sliding (#439)

Signed-off-by: Vigith Maurice <[email protected]> Signed-off-by: Yashash H L <[email protected]> Co-authored-by: Yashash H L <[email protected]>
numaproj · Jan 4, 2023 · f7f712b · f7f712b
1 parent ae7d21f
commit f7f712b
Show file tree

Hide file tree

Showing 7 changed files with 161 additions and 1 deletion.
diff --git a/docs/assets/fixed.png b/docs/assets/fixed.png
diff --git a/docs/assets/sliding.png b/docs/assets/sliding.png
diff --git a/docs/user-guide/user-defined-functions/reduce/reduce.md b/docs/user-guide/user-defined-functions/reduce/reduce.md
@@ -1,3 +1,14 @@
 # Reduce UDF
 
-Work In Progress
+Reduce is one of the most commonly used abstractions in a stream processing pipeline 
+to define aggregation functions on a stream of data. It is the reduce feature
+that helps us solve problems like "performs a summary operation (such as 
+counting the number of occurrence of a key, yielding user login frequencies), etc."
+Since the input an unbounded stream (with infinite entries), we need an
+additional parameter to convert the unbounded problem to a bounded problem
+and provide results on that. That bounding condition is "time", eg, "number
+of users logged in per minute". 
+So while processing an unbounded stream of data, we need a way to group elements 
+into finite chunks using time. To build these chunks the reduce function is applied to the 
+set of records produced using the concept of [windowing](./windowing/windowing.md).
+
diff --git a/docs/user-guide/user-defined-functions/reduce/windowing/fixed.md b/docs/user-guide/user-defined-functions/reduce/windowing/fixed.md
@@ -0,0 +1,37 @@
+# Fixed
+
+Fixed windows (sometimes called tumbling windows) are defined by a static window size, e.g. 30 second
+windows, one minute windows, etc. They are generally aligned, i.e. every window applies across all 
+the data for the corresponding period of time.  It has a fixed size measured in time and does not 
+overlap. The element which belongs to one window will not belong to any other tumbling window.
+For example, a window size of 20 seconds will include all entities of the stream which came in a 
+certain 20-second interval.
+
+![plot](../../../../assets/fixed.png)
+
+To enable Fixed widow, we use `fixed` under `window` section.
+
+```yaml
+groupBy:
+  window:
+    fixed:
+      length:  duration
+```
+
+NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction
+and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
+
+## Length
+
+The `length` is the window size of the fixed window. 
+
+## Example
+
+A 10-second window size can be defined as follows.
+```yaml
+groupBy:
+  window:
+    fixed:
+      length:  60s
+```
+
diff --git a/docs/user-guide/user-defined-functions/reduce/windowing/sliding.md b/docs/user-guide/user-defined-functions/reduce/windowing/sliding.md
@@ -0,0 +1,42 @@
+# Sliding
+
+Sliding window is similar to a Fixed windows, the size of the windows is measured in time and is fixed.
+The important difference from the Fixed window is the fact that it allows an element to be present in
+more than one window. The additional window slide parameter controls how frequently a sliding window
+is started. Hence, sliding windows will be overlapping and the slide should be smaller than the window
+length.
+
+![plot](../../../../assets/sliding.png)
+
+```yaml
+groupBy:
+  window:
+    sliding:
+      length: duration
+      slide: duration
+```
+
+NOTE: A duration string is a possibly signed sequence of decimal numbers, each with optional fraction
+and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
+
+## Length
+
+The `length` is the window size of the fixed window.
+
+## Slide
+
+`slide` is the slide parameter that controls the frequency at which the sliding window is created.
+
+## Example
+
+To create a sliding window of length 1-minute which slides every 10-seconds, we can use the following
+snippet.
+
+```yaml
+groupBy:
+  window:
+    sliding:
+      length: 60s
+      slide: 30s
+```
+
diff --git a/docs/user-guide/user-defined-functions/reduce/windowing/windowing.md b/docs/user-guide/user-defined-functions/reduce/windowing/windowing.md
@@ -0,0 +1,66 @@
+# Windowing
+
+In the world of data processing on an unbounded stream, Windowing is a concept 
+of grouping data using temporal boundaries. We use event-time to discover 
+temporal boundaries on an unbounded, infinite stream and [Watermark](../../../watermarks.md) to ensure 
+the datasets within the boundaries are complete. The [reduce](../reduce.md) is 
+applied on these grouped datasets.
+Example, when we say, we want to find number of users online per minute, we use
+windowing to group the users into one minute buckets.
+
+The entirety of windowing is under the `groupBy` section.
+```yaml
+groupBy:
+  window:
+      ...
+  keyed: ...
+```
+
+Since a window can be [Non-Keyed v/s Keyed](#non-keyed-vs-keyed-windows), 
+we have an explicit field called `keyed`to differentiate between both (see below).
+
+Under the `window` section we will define different types of windows.
+
+## Window Types
+Numaflow supports the following types of windows
+  * [Fixed](fixed.md)
+  * [Sliding](sliding.md)
+
+## Non-Keyed v/s Keyed Windows
+
+#### Non-Keyed
+A non-keyed partition is a partition where the window is the boundary condition.  
+Data processing on a non-keyed partition cannot be scaled horizontally because 
+only one partition exists.
+
+A non-keyed partition is usually used after aggregation and is hardly seen at
+the head section of any data processing pipeline.
+(There is a concept called Global Window where there is no windowing, but 
+let us table that for later).
+
+### Keyed
+A keyed partition is a partition where the partition boundary is a composite
+key of both the window and the key from the payload (e.g., GROUP BY country,
+where country names are the keys). Each smaller partition now has a complete
+set of datasets for that key and boundary. The subdivision of dividing a huge
+window-based partition into smaller partitions by adding keys along with the
+window will help us horizontally scale the distribution.
+
+Keyed partitions are heavily used to aggregate data and are frequently seen
+throughout the processing pipeline. We could also convert and non-keyed problem
+to a set of keyed problems and apply a non-keyed function at the end. This will
+help solve the original problem in a scalable manner without affecting the 
+result's completeness and/or accuracy.
+
+### Usage
+
+Numaflow support both Keyed and Non-Keyed windows. We set `keyed` to either 
+`true` (keyed) or `false` (non-keyed). Please note that the non-keyed windows
+are not horizontally scalable as mentioned above.
+
+```yaml
+groupBy:
+  window:
+    ...
+  keyed: false # or true
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -64,6 +64,10 @@ nav:
                   - Filter: "user-guide/user-defined-functions/map/builtin-functions/filter.md"
           - Reduce:
               - Overview: "user-guide/user-defined-functions/reduce/reduce.md"
+              - Windowing:
+                  - Overview: "user-guide/user-defined-functions/reduce/windowing/windowing.md"
+                  - Fixed: "user-guide/user-defined-functions/reduce/windowing/fixed.md"
+                  - Sliding: "user-guide/user-defined-functions/reduce/windowing/sliding.md"
       - user-guide/pipeline-tuning.md
       - user-guide/conditional-forwarding.md
       - user-guide/autoscaling.md