You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's possible for one check in a suite to fail because a different check in the suite encounters some exception while executing its constraints. I ran into this issue when working with column names that don't exist in the data and I'm unsure whether or not it's intended behavior.
This unit test reproduces the issue:
importcom.amazon.deequ.VerificationSuiteimportcom.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
importorg.apache.spark.sql.SparkSessionimportorg.scalatest.flatspec.AnyFlatSpecclassRuleFailureCrosstalkTestextendsAnyFlatSpec {
caseclassMyData(value: String)
privatevaldata=Seq(MyData("foo"), MyData("bar"))
"A well-defined check" should "pass even if an ill-defined check is also configured" in {
valspark=SparkSession.builder().master("local[*]").getOrCreate()
valdf= spark.createDataFrame(data)
valcheckThatShouldSucceed=Check(CheckLevel.Error, "shouldSucceed").isComplete("value")
valverificationResult=VerificationSuite()
.onData(df)
.addCheck(checkThatShouldSucceed)
.addCheck(
Check(CheckLevel.Error, "shouldError")
.isContainedIn("fakeColumn", 1, 3)
)
.run()
valcheckResult= verificationResult.checkResults(checkThatShouldSucceed)
System.out.println(checkResult.constraintResults.map(_.message))
assert(checkResult.status ==CheckStatus.Success)
}
}
Expected outcome: Unit test passes because the column value is present and complete, so the corresponding check shouldSucceed should succeed.
Actual outcome: Unit test fails. The following message is output from the constraint:
org.apache.spark.sql.AnalysisException: cannot resolve 'fakeColumn' given input columns: [value]; line 1 pos 0;
'Aggregate [sum(cast(isnotnull(value#0) as int)) AS sum(CAST((value IS NOT NULL) AS INT))#15L, count(1) AS count(1)#16L, sum(cast((isnull('fakeColumn) OR (('fakeColumn >= 1.0) AND ('fakeColumn <= 3.0))) as int)) AS sum(CAST(((fakeColumn IS NULL) OR ((fakeColumn >= 1.0) AND (fakeColumn <= 3.0))) AS INT))#17, count(1) AS count(1)#18L]
+- LocalRelation [value#0]
Although the shouldSucceed check doesn't have a constraint on the column fakeColumn, it fails because it's not present.
I noticed the above behavior with this combination of isComplete and isContainedIn, but I didn't check what other combinations might also cause it. Notably though, I noticed that the test actually succeeds for some constraint choices on the shouldError check. (For example, if you replace the isContainedIn constraint on fakeColumn with isComplete, the shouldSucceed check then succeeds.)
agg is a Spark function and when it throws we cannot tell which of the aggregation functions failed.
A quick fix would be to have all Checks that work on one or more columns have a precondition that verifies the column exists. That way we don't have to react to a Spark exception, but instead fail the Check on a precondition being invalid.
It's possible for one check in a suite to fail because a different check in the suite encounters some exception while executing its constraints. I ran into this issue when working with column names that don't exist in the data and I'm unsure whether or not it's intended behavior.
This unit test reproduces the issue:
Expected outcome: Unit test passes because the column
value
is present and complete, so the corresponding checkshouldSucceed
should succeed.Actual outcome: Unit test fails. The following message is output from the constraint:
Although the
shouldSucceed
check doesn't have a constraint on the columnfakeColumn
, it fails because it's not present.I noticed the above behavior with this combination of
isComplete
andisContainedIn
, but I didn't check what other combinations might also cause it. Notably though, I noticed that the test actually succeeds for some constraint choices on theshouldError
check. (For example, if you replace theisContainedIn
constraint onfakeColumn
withisComplete
, theshouldSucceed
check then succeeds.)Deequ version: 2.0.3-spark-3.3
Java version: Corretto 19.0.2
The text was updated successfully, but these errors were encountered: