[ENV-95] Add decision step type (#97)

* Add decision step * Update docs * Simplify pruning code
cloudera-labs · Jul 14, 2017 · 0764a72 · 0764a72
1 parent 9cbe79c
commit 0764a72
Show file tree

Hide file tree

Showing 9 changed files with 752 additions and 54 deletions.
diff --git a/README.md b/README.md
@@ -51,4 +51,5 @@ If you are ready for more, dive in:
 * [Derivers Guide](docs/derivers.adoc) - detailed information on each provided deriver, and how to write custom derivers
 * [Planners Guide](docs/planners.adoc) - directions and details on when, why, and how to use planners and associated outputs
 * [Looping Guide](docs/looping.adoc) - information and an example for defining loops in an Envelope pipeline
+* [Decisions Guide](docs/decisions.adoc) - information on using decisions to dynamically choose which parts of the pipeline to run
 * [Contributing to Envelope](docs/contributing.adoc) - guidelines and best practices for both developing and sharing Envelope components and applications
diff --git a/docs/configurations.adoc b/docs/configurations.adoc
@@ -91,7 +91,7 @@ Step configurations have the `steps.[stepname].` prefix. All steps can have the
 |Configuration suffix|Description
 
 |type
-|The step type. Envelope supports `data` and `loop`. Default `data`.
+|The step type. Envelope supports `data`, `loop`, and `decision`. Default `data`.
 
 |dependencies
 |The list of step names that Envelope will submit before submitting this step.
@@ -100,7 +100,7 @@ Step configurations have the `steps.[stepname].` prefix. All steps can have the
 
 === Data steps
 
-Data steps can additionally have the below configurations.
+Data steps can, additionally to the step configurations, have the below configurations.
 
 [cols="2,8", options="header"]
 |===
@@ -125,7 +125,7 @@ Data steps can additionally have the below configurations.
 
 === Loop steps
 
-Loop steps can additionally have the below configurations.
+Loop steps can, additionally to the step configurations, have the below configurations. For more information on loop steps see the link:looping.adoc[looping guide].
 
 [cols="2,8", options="header"]
 |===
@@ -154,9 +154,34 @@ Loop steps can additionally have the below configurations.
 
 |===
 
+=== Decision steps
+
+Decision steps can, additionally to the step configurations, have the below configurations. For more information on decision steps see the link:decisions.adoc[decisions guide].
+
+[cols="2,8", options="header"]
+|===
+|Configuration suffix|Description
+
+|if-true-steps
+|Required. The list of dependent step names that will be kept if the decision result is true. The steps listed must directly depend on the decision step. The remaining directly dependent steps of the decision step will be kept if the decision result is false. Any steps subsequently dependent on the removed steps will also be removed.
+
+|method
+|Required. The method by which the decision step will make the decision. Envelope supports `literal`, `step_by_key`, `step_by_value`.
+
+|result
+|Required if `method` is `literal`. The true or false result for the decision.
+
+|step
+|Required if `method` is `step_by_key` or `step_by_value`. The name of the previous step from which to extract the decision result.
+
+|key
+|Required if `method` is `step_by_key`. The specific key of the previous step to look up the boolean result by.
+
+|===
+
 === Inputs
 
-Input configurations belong to data steps, and have the `steps.[stepname].input.` prefix.
+Input configurations belong to data steps, and have the `steps.[stepname].input.` prefix. For more information on inputs see the link:inputs.adoc[inputs guide].
 
 [cols="2,8", options="header"]
 |===
@@ -442,7 +467,7 @@ Translator configurations belong to data steps, and have the `steps.[stepname].i
 
 === Derivers
 
-Deriver configurations belong to data steps, and have the `steps.[stepname].deriver.` prefix.
+Deriver configurations belong to data steps, and have the `steps.[stepname].deriver.` prefix. For more information on derivers see the link:derivers.adoc[derivers guide].
 
 [cols="2,8", options="header"]
 |===
@@ -629,7 +654,7 @@ Partitioner configurations belong to data steps, and have the `steps.[stepname].
 
 === Planners
 
-Planner configurations belong to data steps, and have the `steps.[stepname].planner.` prefix.
+Planner configurations belong to data steps, and have the `steps.[stepname].planner.` prefix. For more information on planners see the link:planners.adoc[planners guide].
 
 [cols="2,8", options="header"]
 |===
@@ -952,6 +977,8 @@ cell sizes you may want to reduce this number or increase the relevant client bu
 
 === Repetitions
 
+For more information on repetitions see the link:repetitions.adoc[repetitions guide].
+
 The general configuration parameters for repetitions are:
 
 [cols="2,8a", options="header"]

diff --git a/docs/decisions.adoc b/docs/decisions.adoc
@@ -0,0 +1,189 @@
+= Decisions guide
+
+Envelope provides the ability for a pipeline to make a decision that will determine which steps will be run. This is achieved by including a decision step that, when itself is run, decides which subsequent steps of the pipeline to remove and which ones to keep.
+
+A decision step makes a decision that returns a true or false result. The `if-true-steps` configuration of a decision step specifies which of its dependent steps will be kept if the result is true. The remaining dependent steps will be kept if the result is false.
+
+A decision step can make a decision using one of three methods, which are outlined with examples in the section below.
+
+== Decision methods
+
+=== Literal
+
+The `literal` decision method takes the true or false result directly from the configuration of the decision step. This method would be useful if the result is provided by a parameter, which in turn can be populated by a `spark2-submit` argument or an environment variable.
+
+In this self-contained example the value of the `${result}` parameter will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:
+
+----
+application.name = Decision step by literal
+steps {
+  decide {
+    type = decision
+    if-true-steps = [run_if_true]
+    method = literal
+    result = ${result}
+  }
+  run_if_true {
+    dependencies = [decide]
+    deriver {
+      type = sql
+      query.literal = "SELECT true"
+    }
+    print.data.enabled = true
+  }
+  run_after_run_if_true {
+    dependencies = [run_if_true]
+    deriver {
+      type = sql
+      query.literal = "SELECT 'No, really, it was true!'"
+    }
+    print.data.enabled = true
+  }
+  run_if_false {
+    dependencies = [decide]
+    deriver {
+      type = sql
+      query.literal = "SELECT false"
+    }
+    print.data.enabled = true
+  }
+  run_after_run_if_false {
+    dependencies = [run_if_false]
+    deriver {
+      type = sql
+      query.literal = "SELECT 'No, really, it was false!'"
+    }
+    print.data.enabled = true
+  }
+}
+----
+
+This pipeline could be run with `${result}` populated by using an argument after the configuration file:
+
+  spark2-submit envelope-*.jar pipeline.conf result=true
+
+=== Step by key
+
+The `step_by_key` decision method takes the result from the data of a previous step, where the result is looked up in that data by a specific key. This method would be useful for making decisions on data quality results that provide a true or false result for each dataset-scoped check.
+
+The data of the step must contain only two columns: first a string (the key), and second a boolean (the result).
+
+In this self-contained example the corresponding value of the `test1` key in the `generate` step will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:
+
+----
+application.name = Decision step by step by key
+steps {
+  generate {
+    deriver {
+      type = sql
+      query.literal = "SELECT 'test1', true UNION ALL SELECT 'test2', false"
+    }
+  }
+  decide {
+    dependencies = [generate]
+    type = decision
+    if.true.steps = [run_if_true]
+    decision.method = step_by_key
+    step = generate
+    key = test1
+  }
+  run_if_true {
+    dependencies = [decide]
+    deriver {
+      type = sql
+      query.literal = "SELECT true"
+    }
+    print.data.enabled = true
+  }
+  run_after_true {
+    dependencies = [run_if_true]
+    deriver {
+      type = sql
+      query.literal = "SELECT 'No, really, it was true!'"
+    }
+    print.data.enabled = true
+  }
+  run_if_false {
+    dependencies = [decide]
+    deriver {
+      type = sql
+      query.literal = "SELECT false"
+    }
+    print.data.enabled = true
+  }
+  run_after_false {
+    dependencies = [run_if_false]
+    deriver {
+      type = sql
+      query.literal = "SELECT 'No, really, it was false!'"
+    }
+    print.data.enabled = true
+  }
+}
+----
+
+=== Step by value
+
+The `step_by_value` decision method takes the result from the single boolean value of a previous step. This method would be useful when a previous step has a deriver that aggregates into a single result.
+
+The data of the step must contain a single boolean column and only a single row.
+
+In this self-contained example the sole value of `aggregate` step will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:
+
+----
+application.name = Decision step by step by value
+steps {
+  generate {
+    deriver {
+      type = sql
+      query.literal = "SELECT 'test1' AS key, true AS result UNION ALL SELECT 'test2' AS key, false AS result"
+    }
+  }
+  aggregate {
+    deriver {
+      type = sql
+      query.literal = "SELECT MIN(result) = true AS result FROM generate"
+    }
+  }
+  decide {
+    dependencies = [aggregate]
+    type = decision
+    if.true.steps = [run_if_true]
+    decision.method = step_by_key
+    step = generate
+    key = test1
+  }
+  run_if_true {
+    dependencies = [decide]
+    deriver {
+      type = sql
+      query.literal = "SELECT true"
+    }
+    print.data.enabled = true
+  }
+  run_after_true {
+    dependencies = [run_if_true]
+    deriver {
+      type = sql
+      query.literal = "SELECT 'No, really, it was true!'"
+    }
+    print.data.enabled = true
+  }
+  run_if_false {
+    dependencies = [decide]
+    deriver {
+      type = sql
+      query.literal = "SELECT false"
+    }
+    print.data.enabled = true
+  }
+  run_after_false {
+    dependencies = [run_if_false]
+    deriver {
+      type = sql
+      query.literal = "SELECT 'No, really, it was false!'"
+    }
+    print.data.enabled = true
+  }
+}
+----