Skip to content

Commit

Permalink
[ENV-126] Data Quality Deriver (#96)
Browse files Browse the repository at this point in the history
  • Loading branch information
Ian Buss authored and Jeremy Beard committed Jul 14, 2017
1 parent 0c9d565 commit 9cbe79c
Show file tree
Hide file tree
Showing 24 changed files with 2,184 additions and 21 deletions.
78 changes: 77 additions & 1 deletion docs/configurations.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -449,7 +449,7 @@ Deriver configurations belong to data steps, and have the `steps.[stepname].deri
|Configuration suffix|Description

|type
|The deriver type to be used. Envelope provides `morphline`, `nest`, `passthrough`, `sql`, `pivot`, and `exclude`. To use a custom deriver, specify the fully qualified name of the `Deriver` implementation class.
|The deriver type to be used. Envelope provides `morphline`, `nest`, `passthrough`, `sql`, `pivot`, `exclude` and `dq`. To use a custom deriver, specify the fully qualified name of the `Deriver` implementation class.

|repartition.partitions
|The number of DataFrame partitions to repartition the deriver results by. In Spark this will run `DataFrame#repartition`. If this configuration is not provided then Envelope will not repartition the deriver results.
Expand Down Expand Up @@ -536,6 +536,82 @@ Deriver configurations belong to data steps, and have the `steps.[stepname].deri
|field.names
|The name of the fields used to match between the two datasets. The field names must be identical in name and type. A row is excluded if all of the fields are equal between the datasets.

||
|`_dq_`|

|scope
|Required. The scope at which to apply the DQ deriver. `dataset` or `row`.

|rules
|Required. A nested object of rules. Each defined object should contain a field `type`, which defines the type of the DQ rule, either a built-in or a fully-qualified classname. Type specific configs are listed below.

||
|_checknulls_|

|fields
|Required. The list of fields to check. The contents should be a list of strings.

||
|_enum_|

|fields
|Required. String list of field names.

|fieldtype
|Optional. Type of the field to check for defined values: must be `string`, `long`, `int`, or `decimal`. Defaults to `string`.

|values
|Required. List of values. For strings and decimals define the values using string literals. For integral types use number literals.

|case-sensitive
|Optional. For string values, whether the value matches should be case-sensitive. Defaults to true.

||
|_range_|

|fields
|Required. List of field names on which to apply the range checks.

|fieldtype
|Optional. The field type to use when doing range checks. Range values will be interpreted as this type. Must be numeric: allowed values are
`int`, `long`, `double`, `float`, `decimal`. Take care when using floating point values as exact boundary matches may not behave as expected - use
`decimal` if exact boundaries are required. Defaults to `long`.

|range
|Required. Two element list of numeric literals, e.g. `[1,10]` or `[1.5,10.45]`. Both boundaries are inclusive.

||
|_regex_|

|fields
|Required. String list of field names, which should all have type `string`.

|regex
|Required. Regular expression with which to match field values. Note that extra escape parameters are not required. For example to match any number up to 999 you could use: `\d{1,3}`.

||
|_count_|

|expected.literal
|Either this or `expected.dependency` required. A `long` literal with the expected number of rows in the dataset.

|expected.dependency
|Either this or `expected.literal` required. A string indicating the dependency in which the expected
count is defined. It must be a dataframe with a single field of type `long`.

||
|_checkschema_|

|fields
|Required. A list of fields and types that are required to be in the dataset. List elements should be objects with
two fields: `name` and `type`. Valid types are: `string`, `byte`, `short`, `int`, `long`, `float`, `decimal`,
`boolean`, `binary`, `date`, `timestamp`. For `decimal`, two additional int fields are required: `scale` and `precision`.

|exactmatch
|Optional. Whether the schema of the Rows must exactly match the specified schema. If false the actual row can contain
other fields not specified in the `fields` configuration. Those that are specified must match both name and type. Defaults
to false.

|===

=== Partitioners
Expand Down
153 changes: 153 additions & 0 deletions docs/derivers.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,159 @@ The equivalent SQL statement would read:
SELECT Left.* FROM Left LEFT ANTI JOIN Right USING (field1, field2)
----

=== Data Quality

The `dq` deriver can be used to perform data quality checks on a dataset using a set of user-defined
rules. Rules can be applied at two scopes: at dataset or row level. For dataset scope, the rules are
evaluated against the dataset as a whole and the derived result is a dataset containing one row per rule indicating a pass or fail. The
schema of the dataset is `name: String, result: Boolean`. For
example, the result might be:

[options="header", width="30%"]
|===
|name|result
|namecheck|true
|agerange|false
|===

Row level scope takes the list of rules and applies them to every row for the defined input dependency.
The results of the checks are appended to the rows as a field of type `map<string, boolean>` called
`results` by default. The results would look something like:

[options="header", width="50%"]
|===
|name|age|results
|Ian|null|{"namenotnull":true,"agerange":false}
|Webster|21|{"namenotnull":true,"agerange":true}
|===

Envelope has a number of built-in rules (see below) but allows for custom user-defined rules via fully-qualified
class name. See the link:configurations.adoc[config guide] for specific configuration parameters.

==== Row Scope Rules

The following row-level rules are provided:

* `checknulls` - check for the null values in one or more fields in a row
* `enum` - check one or more fields against a list of allowed values (non-floating point numerics and strings)
* `range` - check one or more numeric fields is between upper and lower bounds (inclusive)
* `regex` - check one or more string fields against an allowed pattern

==== Dataset Scope Rules

The following rules are defined at the dataset scope:

* `count` - ensure the dataset has an expected count. The count may either statically defined or
loaded as a dependency from another step. If the latter, the Dataset must contain a single row with
a single field of type long.
* `checkschema` - ensure the dataset matches the schema. Currently only supports primitive types.

In addition, any defined row-level rule can be applied at the dataset scope. In this case, the deriver simply logically
ANDs the individual results from each row check into a single boolean result for the rule.

If specifying multiple dependencies, the user must specify to which dependency the dataset-level rules
should be applied using the `dataset` configuration parameter.

If using multiple dataset level checks on the same dataset it is recommended to employ the `cache` hint
on the dependency containing the data to be checked.

==== Example Configuration

An example configuration containing both dataset and row-level DQ derivers is as follows:

```
...

steps {
dqparams {
input {
type = filesystem
format = json
path = "hdfs:///tmp/dqparams"
}
}

mydata {
input {
type = filesystem
format = json
path = "hdfs:///tmp/data"
}
}

checkmydata {
dependencies = [mydata,dqparams]
deriver {
type = dq
scope = dataset
dataset = mydata
rules {
r1 {
type = count
expected.dependency = dqparams
}
r2 {
type = checkschema
fields = [
{ name = "name", type = "string" },
{ name = "address", type = "string },
{ name = "age", type = "age" }
]
}
r3 {
// row-level rule being run in dataset scope
type = regex
fields = ["name"]
regex = "[a-zA-Z' ]{1,}"
}
r4 {
// row-level rule beingf run in dataset scope
type = enum
fields = ["name"]
values = ["Ian","Jeremy","Webster"]
fieldtype = string
case-sensitive = false
}
}
}
}

checkrows {
dependencies = [mydata]
deriver {
type = dq
scope = row
rules {
r1 {
type = checknulls
fields = [ "name", "address", "age" ]
}
r2 {
type = regex
fields = ["name"]
regex = "[a-zA-Z' ]{1,}"
}
r3 {
type = range
fields = ["age"]
fieldtype = "int"
range = [0,150]
}
}
}
}
}
...
```

==== Developing Custom Rules

Users wishing to specify custom rules can extend either the `RowRule` or `DatasetRule` interface. Row
level rules should implement a `check(Row row)` method returning a boolean result. Dataset
scope rules should implement a `check(Dataset<Row> dataset, Map<String, Dataset<Row>> stepDependencies)`
method which returns a Dataset with a row per rule with the schema `name: String, result: Boolean`.
Row level rules are automatically wrapped in `DatasetRowRuleWrapper` when used in a dataset scope.

== Custom derivers

In cases that Envelope does not provide a deriver that meets the requirements for a particular derivation a custom deriver can be developed and provided instead.
Expand Down
Loading

0 comments on commit 9cbe79c

Please sign in to comment.