Skip to content

Commit

Permalink
[ENV-95] Add decision step type (#97)
Browse files Browse the repository at this point in the history
* Add decision step

* Update docs

* Simplify pruning code
  • Loading branch information
Jeremy Beard authored and Ian Buss committed Jul 14, 2017
1 parent 9cbe79c commit 0764a72
Show file tree
Hide file tree
Showing 9 changed files with 752 additions and 54 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,5 @@ If you are ready for more, dive in:
* [Derivers Guide](docs/derivers.adoc) - detailed information on each provided deriver, and how to write custom derivers
* [Planners Guide](docs/planners.adoc) - directions and details on when, why, and how to use planners and associated outputs
* [Looping Guide](docs/looping.adoc) - information and an example for defining loops in an Envelope pipeline
* [Decisions Guide](docs/decisions.adoc) - information on using decisions to dynamically choose which parts of the pipeline to run
* [Contributing to Envelope](docs/contributing.adoc) - guidelines and best practices for both developing and sharing Envelope components and applications
39 changes: 33 additions & 6 deletions docs/configurations.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Step configurations have the `steps.[stepname].` prefix. All steps can have the
|Configuration suffix|Description

|type
|The step type. Envelope supports `data` and `loop`. Default `data`.
|The step type. Envelope supports `data`, `loop`, and `decision`. Default `data`.

|dependencies
|The list of step names that Envelope will submit before submitting this step.
Expand All @@ -100,7 +100,7 @@ Step configurations have the `steps.[stepname].` prefix. All steps can have the

=== Data steps

Data steps can additionally have the below configurations.
Data steps can, additionally to the step configurations, have the below configurations.

[cols="2,8", options="header"]
|===
Expand All @@ -125,7 +125,7 @@ Data steps can additionally have the below configurations.

=== Loop steps

Loop steps can additionally have the below configurations.
Loop steps can, additionally to the step configurations, have the below configurations. For more information on loop steps see the link:looping.adoc[looping guide].

[cols="2,8", options="header"]
|===
Expand Down Expand Up @@ -154,9 +154,34 @@ Loop steps can additionally have the below configurations.

|===

=== Decision steps

Decision steps can, additionally to the step configurations, have the below configurations. For more information on decision steps see the link:decisions.adoc[decisions guide].

[cols="2,8", options="header"]
|===
|Configuration suffix|Description

|if-true-steps
|Required. The list of dependent step names that will be kept if the decision result is true. The steps listed must directly depend on the decision step. The remaining directly dependent steps of the decision step will be kept if the decision result is false. Any steps subsequently dependent on the removed steps will also be removed.

|method
|Required. The method by which the decision step will make the decision. Envelope supports `literal`, `step_by_key`, `step_by_value`.

|result
|Required if `method` is `literal`. The true or false result for the decision.

|step
|Required if `method` is `step_by_key` or `step_by_value`. The name of the previous step from which to extract the decision result.

|key
|Required if `method` is `step_by_key`. The specific key of the previous step to look up the boolean result by.

|===

=== Inputs

Input configurations belong to data steps, and have the `steps.[stepname].input.` prefix.
Input configurations belong to data steps, and have the `steps.[stepname].input.` prefix. For more information on inputs see the link:inputs.adoc[inputs guide].

[cols="2,8", options="header"]
|===
Expand Down Expand Up @@ -442,7 +467,7 @@ Translator configurations belong to data steps, and have the `steps.[stepname].i

=== Derivers

Deriver configurations belong to data steps, and have the `steps.[stepname].deriver.` prefix.
Deriver configurations belong to data steps, and have the `steps.[stepname].deriver.` prefix. For more information on derivers see the link:derivers.adoc[derivers guide].

[cols="2,8", options="header"]
|===
Expand Down Expand Up @@ -629,7 +654,7 @@ Partitioner configurations belong to data steps, and have the `steps.[stepname].

=== Planners

Planner configurations belong to data steps, and have the `steps.[stepname].planner.` prefix.
Planner configurations belong to data steps, and have the `steps.[stepname].planner.` prefix. For more information on planners see the link:planners.adoc[planners guide].

[cols="2,8", options="header"]
|===
Expand Down Expand Up @@ -952,6 +977,8 @@ cell sizes you may want to reduce this number or increase the relevant client bu

=== Repetitions

For more information on repetitions see the link:repetitions.adoc[repetitions guide].

The general configuration parameters for repetitions are:

[cols="2,8a", options="header"]
Expand Down
189 changes: 189 additions & 0 deletions docs/decisions.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
= Decisions guide

Envelope provides the ability for a pipeline to make a decision that will determine which steps will be run. This is achieved by including a decision step that, when itself is run, decides which subsequent steps of the pipeline to remove and which ones to keep.

A decision step makes a decision that returns a true or false result. The `if-true-steps` configuration of a decision step specifies which of its dependent steps will be kept if the result is true. The remaining dependent steps will be kept if the result is false.

A decision step can make a decision using one of three methods, which are outlined with examples in the section below.

== Decision methods

=== Literal

The `literal` decision method takes the true or false result directly from the configuration of the decision step. This method would be useful if the result is provided by a parameter, which in turn can be populated by a `spark2-submit` argument or an environment variable.

In this self-contained example the value of the `${result}` parameter will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:

----
application.name = Decision step by literal
steps {
decide {
type = decision
if-true-steps = [run_if_true]
method = literal
result = ${result}
}
run_if_true {
dependencies = [decide]
deriver {
type = sql
query.literal = "SELECT true"
}
print.data.enabled = true
}
run_after_run_if_true {
dependencies = [run_if_true]
deriver {
type = sql
query.literal = "SELECT 'No, really, it was true!'"
}
print.data.enabled = true
}
run_if_false {
dependencies = [decide]
deriver {
type = sql
query.literal = "SELECT false"
}
print.data.enabled = true
}
run_after_run_if_false {
dependencies = [run_if_false]
deriver {
type = sql
query.literal = "SELECT 'No, really, it was false!'"
}
print.data.enabled = true
}
}
----

This pipeline could be run with `${result}` populated by using an argument after the configuration file:

spark2-submit envelope-*.jar pipeline.conf result=true

=== Step by key

The `step_by_key` decision method takes the result from the data of a previous step, where the result is looked up in that data by a specific key. This method would be useful for making decisions on data quality results that provide a true or false result for each dataset-scoped check.

The data of the step must contain only two columns: first a string (the key), and second a boolean (the result).

In this self-contained example the corresponding value of the `test1` key in the `generate` step will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:

----
application.name = Decision step by step by key
steps {
generate {
deriver {
type = sql
query.literal = "SELECT 'test1', true UNION ALL SELECT 'test2', false"
}
}
decide {
dependencies = [generate]
type = decision
if.true.steps = [run_if_true]
decision.method = step_by_key
step = generate
key = test1
}
run_if_true {
dependencies = [decide]
deriver {
type = sql
query.literal = "SELECT true"
}
print.data.enabled = true
}
run_after_true {
dependencies = [run_if_true]
deriver {
type = sql
query.literal = "SELECT 'No, really, it was true!'"
}
print.data.enabled = true
}
run_if_false {
dependencies = [decide]
deriver {
type = sql
query.literal = "SELECT false"
}
print.data.enabled = true
}
run_after_false {
dependencies = [run_if_false]
deriver {
type = sql
query.literal = "SELECT 'No, really, it was false!'"
}
print.data.enabled = true
}
}
----

=== Step by value

The `step_by_value` decision method takes the result from the single boolean value of a previous step. This method would be useful when a previous step has a deriver that aggregates into a single result.

The data of the step must contain a single boolean column and only a single row.

In this self-contained example the sole value of `aggregate` step will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:

----
application.name = Decision step by step by value
steps {
generate {
deriver {
type = sql
query.literal = "SELECT 'test1' AS key, true AS result UNION ALL SELECT 'test2' AS key, false AS result"
}
}
aggregate {
deriver {
type = sql
query.literal = "SELECT MIN(result) = true AS result FROM generate"
}
}
decide {
dependencies = [aggregate]
type = decision
if.true.steps = [run_if_true]
decision.method = step_by_key
step = generate
key = test1
}
run_if_true {
dependencies = [decide]
deriver {
type = sql
query.literal = "SELECT true"
}
print.data.enabled = true
}
run_after_true {
dependencies = [run_if_true]
deriver {
type = sql
query.literal = "SELECT 'No, really, it was true!'"
}
print.data.enabled = true
}
run_if_false {
dependencies = [decide]
deriver {
type = sql
query.literal = "SELECT false"
}
print.data.enabled = true
}
run_after_false {
dependencies = [run_if_false]
deriver {
type = sql
query.literal = "SELECT 'No, really, it was false!'"
}
print.data.enabled = true
}
}
----
Loading

0 comments on commit 0764a72

Please sign in to comment.