Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Row Level Results #451

Merged
merged 14 commits into from
Feb 22, 2023
Merged

Conversation

mentekid
Copy link
Contributor

@mentekid mentekid commented Feb 9, 2023

Issue #, if available:
N/A

Description of changes:

This is an early version of the row-level results feature. I have enabled row-level results in three constraints: IsComplete, HasCompleteness, and MaxLength.

This requires the definition and change of certain classes defined in Deequ both to expose information previously not available, and to allow distinguishing between different analyzers' results.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -41,4 +41,6 @@ case class MaxLength(column: String, where: Option[String] = None)
}

override def filterCondition: Option[String] = where

private def criterion: Column = length(conditionalSelection(column, where)).cast(DoubleType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private method here and a public method in Completeness. If it is required in each analyzer, should it be part of a base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other one is used by a test that constructs an expected object and compares the result of Completeness to that object. It shouldn't be public because I don't think anything in the code should invoke it. I'll make it package-private inside deequ and mark it as @VisibleForTesting to point out it's an internal detail for the analyzer

@mentekid
Copy link
Contributor Author

Opening this against master instead #452

@mentekid mentekid changed the base branch from release/2.0.0-spark-3.1 to master February 22, 2023 18:05
@rdsharma26
Copy link
Contributor

Thanks for updating the branch to master. LGTM.

@mentekid mentekid merged commit 63b567b into awslabs:master Feb 22, 2023
rdsharma26 pushed a commit that referenced this pull request Feb 28, 2023
* Demo implementation of returning row-level results from metrics

* Row-level results from VerificationResult

* Row-level results from VerificationResult

* Fix some tests by expecting a full result column

* Fix Deequ tests to expect full Completeness result

* Checks can return row-level result column names, if any

* Make Analyzer and Constraint classes serializable explicitly

* Refactor tests

* Move row-level management to trait

* MaxLength analyzer returns length of each record

* Refactor VerificationResult to correctly match Metrics to Analyzers

* VerificationResult aggregates all columns for a check

* Return row-level results for two constraints

* Improve naming and comments

---------

Co-authored-by: Yannis Mentekidis <[email protected]>
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
* Demo implementation of returning row-level results from metrics

* Row-level results from VerificationResult

* Row-level results from VerificationResult

* Fix some tests by expecting a full result column

* Fix Deequ tests to expect full Completeness result

* Checks can return row-level result column names, if any

* Make Analyzer and Constraint classes serializable explicitly

* Refactor tests

* Move row-level management to trait

* MaxLength analyzer returns length of each record

* Refactor VerificationResult to correctly match Metrics to Analyzers

* VerificationResult aggregates all columns for a check

* Return row-level results for two constraints

* Improve naming and comments

---------

Co-authored-by: Yannis Mentekidis <[email protected]>
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
* Demo implementation of returning row-level results from metrics

* Row-level results from VerificationResult

* Row-level results from VerificationResult

* Fix some tests by expecting a full result column

* Fix Deequ tests to expect full Completeness result

* Checks can return row-level result column names, if any

* Make Analyzer and Constraint classes serializable explicitly

* Refactor tests

* Move row-level management to trait

* MaxLength analyzer returns length of each record

* Refactor VerificationResult to correctly match Metrics to Analyzers

* VerificationResult aggregates all columns for a check

* Return row-level results for two constraints

* Improve naming and comments

---------

Co-authored-by: Yannis Mentekidis <[email protected]>
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
* Demo implementation of returning row-level results from metrics

* Row-level results from VerificationResult

* Row-level results from VerificationResult

* Fix some tests by expecting a full result column

* Fix Deequ tests to expect full Completeness result

* Checks can return row-level result column names, if any

* Make Analyzer and Constraint classes serializable explicitly

* Refactor tests

* Move row-level management to trait

* MaxLength analyzer returns length of each record

* Refactor VerificationResult to correctly match Metrics to Analyzers

* VerificationResult aggregates all columns for a check

* Return row-level results for two constraints

* Improve naming and comments

---------

Co-authored-by: Yannis Mentekidis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants