Skip to content

pysparkdq is a lightweight columnar validation framework for PySpark DataFrames.

License

Notifications You must be signed in to change notification settings

olivermeyer/pyspark-dq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pysparkdq

pysparkdq is a lightweight columnar validation framework for PySpark DataFrames.

The framework is based largely on Amazon's Deequ package; it is to some extent a highly simplified, Python-translated Deequ minus the stateful statistical modelling part.

The framework is built around three core objects:

  • The CheckOperator collects a DataFrame and one or more Check objects, and instructs a DataFrameValidator to run the checks on the DataFrame.
  • The DataFrameValidator runs the checks on the DataFrame and returns two DataFrames: one containing valid rows; the other invalid rows.
  • The Check objects define columnar validations to apply to a DataFrame.

Usage

The framework is designed for batch processing. A typical expected usage pattern looks like:

  1. Read data from storage into a Spark DataFrame (not covered by the framework)
  2. Create a CheckOperator with this DataFrame as parameter
  3. Add any number of Check objects to the CheckOperator
  4. Run the checks
  5. Write invalid rows to a retention area; make valid rows available for further processing (not covered by the framework)

In code:

df = spark.read.parquet("s3a://path/to/input")
check_operator = CheckOperator(
	dataframe=df
).add_check(
	ColumnIsNotNullCheck("id"),
).add_check(
	ColumnIsNotNegativeCheck("age")
).add_check(
	ColumnIsInValuesCheck(
		"country", ["DE", "GB"]
	)
).add_check(
	ColumnSetIsUniqueCheck(
		["id", "country"]
	)
)
valid_df, invalid_df = check_operator.run()
valid_df.write.parquet("s3a://path/to/output")
invalid_df.write.parquet("s3a://path/to/retention/area")

See also an actual application in example.py.

How it works

The output of a validation is written to the DataFrame as a boolean field. For example, given the following DataFrame:

| id   | name |
|------|------|
| 1    | Foo  |
| 2    |      |

And a validation rule stating that name should not be null (i.e. ColumnIsNotNullCheck), the output of the validation would be:

| id   | name | name_is_not_null |
|------|------|------------------|
| 1    | Foo  | True             |
| 2    |      | False            |

The name of the additional column is defined in the corresponding Check object. The additional columns are included in the invalid rows DataFrame, but not in the valid rows DataFrame returned by the DataFrameValidator.

Next steps and possible improvements

  • Make the framework capable of reading/writing data; the expected gain is to enable config-only usage
  • Add logging

About

pysparkdq is a lightweight columnar validation framework for PySpark DataFrames.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published