Execution failure crosstalk between different checks in a suite #467

marcantony · 2023-04-13T21:12:12Z

It's possible for one check in a suite to fail because a different check in the suite encounters some exception while executing its constraints. I ran into this issue when working with column names that don't exist in the data and I'm unsure whether or not it's intended behavior.

This unit test reproduces the issue:

import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
import org.apache.spark.sql.SparkSession
import org.scalatest.flatspec.AnyFlatSpec

class RuleFailureCrosstalkTest extends AnyFlatSpec {

  case class MyData(value: String)
  private val data = Seq(MyData("foo"), MyData("bar"))

  "A well-defined check" should "pass even if an ill-defined check is also configured" in {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val df = spark.createDataFrame(data)

    val checkThatShouldSucceed =
      Check(CheckLevel.Error, "shouldSucceed").isComplete("value")
    val verificationResult = VerificationSuite()
      .onData(df)
      .addCheck(checkThatShouldSucceed)
      .addCheck(
        Check(CheckLevel.Error, "shouldError")
          .isContainedIn("fakeColumn", 1, 3)
      )
      .run()

    val checkResult = verificationResult.checkResults(checkThatShouldSucceed)
    System.out.println(checkResult.constraintResults.map(_.message))
    assert(checkResult.status == CheckStatus.Success)
  }
}

Expected outcome: Unit test passes because the column value is present and complete, so the corresponding check shouldSucceed should succeed.

Actual outcome: Unit test fails. The following message is output from the constraint:

org.apache.spark.sql.AnalysisException: cannot resolve 'fakeColumn' given input columns: [value]; line 1 pos 0;
'Aggregate [sum(cast(isnotnull(value#0) as int)) AS sum(CAST((value IS NOT NULL) AS INT))#15L, count(1) AS count(1)#16L, sum(cast((isnull('fakeColumn) OR (('fakeColumn >= 1.0) AND ('fakeColumn <= 3.0))) as int)) AS sum(CAST(((fakeColumn IS NULL) OR ((fakeColumn >= 1.0) AND (fakeColumn <= 3.0))) AS INT))#17, count(1) AS count(1)#18L]
+- LocalRelation [value#0]

Although the shouldSucceed check doesn't have a constraint on the column fakeColumn, it fails because it's not present.

I noticed the above behavior with this combination of isComplete and isContainedIn, but I didn't check what other combinations might also cause it. Notably though, I noticed that the test actually succeeds for some constraint choices on the shouldError check. (For example, if you replace the isContainedIn constraint on fakeColumn with isComplete, the shouldSucceed check then succeeds.)

Deequ version: 2.0.3-spark-3.3
Java version: Corretto 19.0.2

The text was updated successfully, but these errors were encountered:

mentekid · 2023-04-14T15:49:44Z

Thanks for reporting this. We will review and let you know if it's a bug or intended behavior.

mentekid · 2023-04-14T20:27:34Z

Looked into this a bit more. This is a bug in our execution logic.

We collect all scans required for analyzers and run them all at once here:
https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/analyzers/runners/AnalysisRunner.scala#L325

Spark throws because one of the scans required involves fakeColumn, which doesn't exist. We catch that exception and we fail all Analyzers with it:

try {
  ...
  val results = data.agg(aggregations.head, aggregations.tail: _*).collect().head
  ...
} catch {
  case error: Exception =>
    shareableAnalyzers.map { analyzer => analyzer -> analyzer.toFailureMetric(error) }
}

agg is a Spark function and when it throws we cannot tell which of the aggregation functions failed.

A quick fix would be to have all Checks that work on one or more columns have a precondition that verifies the column exists. That way we don't have to react to a Spark exception, but instead fail the Check on a precondition being invalid.

mentekid added bug Something isn't working good first issue Good for newcomers labels Apr 14, 2023

This was referenced May 4, 2023

Missing Column Precondition for Compliance Check - issue fix 467 samarth-c1/deequ#1

Merged

Missing Column Precondition for Compliance Check - issue fix 467 #478

Merged

mentekid closed this as completed in #478 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution failure crosstalk between different checks in a suite #467

Execution failure crosstalk between different checks in a suite #467

marcantony commented Apr 13, 2023

mentekid commented Apr 14, 2023

mentekid commented Apr 14, 2023

Execution failure crosstalk between different checks in a suite #467

Execution failure crosstalk between different checks in a suite #467

Comments

marcantony commented Apr 13, 2023

mentekid commented Apr 14, 2023

mentekid commented Apr 14, 2023