CustomAggregator #572

joshuazexter · 2024-07-29T15:51:32Z

This pull request introduces the CustomAggregator, a tool designed for dynamic data aggregation based on user-specified conditions within Apache Spark DataFrames. This addition can preform customized metric calculations and aggregations, making it applicable where conditional data aggregation is required.

Core Features:

Custom Aggregation Logic: Users can pass a lambda function that specifies how data should be aggregated. This function is applied to a DataFrame to compute a state representing the aggregation result.
Generic Metric Computation: Post aggregation, the analyzer computes metrics from the aggregated data state

How It Can Be Used:
To use the CustomAggregator, developers will need to:

Define a lambda function that describes the aggregation logic specific to their data and requirements on a specific column.
Instantiate the analyzer with this function, specifying the relevant metric names and instances.
Apply the analyzer to a DataFrame within a Spark session to compute and retrieve metrics.

Usage Examples:
Included in the pull request are unit tests that demonstrate potential use cases:

Content Engagement Metrics:

Scenario Description: A media company wants to assess how different types of content perform across various social media platforms to guide content strategy and investment.
Data: Assume the company has data in the form of a DataFrame that includes columns for content_type, platform, views, likes, and shares.
Analysis Logic: The company uses the CustomAggregator to aggregate engagement metrics (views, likes, shares) for each content type across platforms.
Implementation Example:

val contentEngagementLambda: DataFrame => AggregatedMetricState = df => {
  val counts = df
    .groupBy("content_type")
    .agg(
      (sum("views") + sum("likes") + sum("shares")).cast("int").alias("totalEngagements")
    )
    .collect()
    .map(row =>
      row.getString(0) -> row.getInt(1)
    )
    .toMap
  val totalEngagements = counts.values.sum
  AggregatedMetricState(counts, totalEngagements)
}

val analyzer = CustomAggregator(contentEngagementLambda, "ContentEngagement", "AllPlatforms")

val data = session.read.format("csv").option("header", "true").load("path_to_data_file")
val state = analyzer.computeStateFrom(data)
val metric = analyzer.computeMetricFrom(state)

println("Content Engagement Metrics: " + metric.value.get)
//  Content Engagement Metrics: Map(Video -> 0.81, Article -> 0.18)

Resource Utilization in Cloud Services:

Scenario Description: An IT administrator needs to monitor and analyze resource utilization across different cloud services to ensure efficient usage and cost management.
Data: The organization collects usage data for each cloud service, including CPU hours, memory GBs used, and storage GBs used, stored in a DataFrame.
Analysis Logic: The analyzer is used to aggregate and compute the total and percentage utilization of each resource type across services.
Implementation Example:

val resourceUtilizationLambda: DataFrame => AggregatedMetricState = df => {
  val totalResources = df.groupBy("service_type")
    .agg(
      ((sum("cpu_hours") + sum("memory_gbs") + sum("storage_gbs")).cast("int") / df.count()).alias("percentageResources")
    )
    .collect()
    .map(row =>
      row.getString(0) -> row.getDouble(1)
    )
    .toMap
  val totalSum = totalResources.values.sum
  AggregatedMetricState(resourceUtilizationLambda, totalSum.toInt)
}

val analyzer = CustomAggregator(resourceUtilizationLambda, "ResourceUtilization", "CloudServices")

val data = session.read.format("csv").option("header", "true").load("path_to_usage_data_file")
val state = analyzer.computeStateFrom(data)
val metric = analyzer.computeMetricFrom(state)

println("Resource Utilization Metrics: " + metric.value.get)
//  Resource Utilization Metrics: Map(Compute -> 0.51, Database -> 0.27, Storage -> 0.21)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…nto entityTypes

rdsharma26 · 2024-07-29T19:17:35Z

Can we add a unit test that shows the usage of this analyzer along with other analyzers? See ColumnProfilerRunner and this readme

rdsharma26 · 2024-07-29T19:16:18Z

src/main/scala/com/amazon/deequ/analyzers/ConditionalAggregationAnalyzer.scala

+                                          instance: String)
+  extends Analyzer[AggregatedMetricState, AttributeDoubleMetric] {
+
+  def computeStateFrom(data: DataFrame, filterCondition: Option[String] = None)


Can we add the override keyword here and in front of computeMetricFrom?

rdsharma26 · 2024-07-29T19:19:45Z

Great PR description! Can you also add the output of the println statements ?

rdsharma26 · 2024-07-29T19:21:06Z

src/main/scala/com/amazon/deequ/analyzers/ConditionalAggregationAnalyzer.scala

+// Define the analyzer
+case class ConditionalAggregationAnalyzer(aggregatorFunc: DataFrame => AggregatedMetricState,
+                                          metricName: String,
+                                          instance: String)


Since we are running the aggregator on the entire dataframe, we can probably use Dataset for the instance (like how we do in other analyzers like rowcount). That way, we do not need to ask for this parameter from the user. We should keep the public facing API as simple as possible.

…nto entityTypes

eycho-am

Great PR on both the implementation and description

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.4 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Match Breeze version with spark 3.3 (#562) * Updated version in pom.xml to 2.0.8-spark-3.3 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.2 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * pdated version in pom.xml to 2.0.8-spark-3.1 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

Joshua Zexter added 5 commits July 29, 2024 10:07

Add support for EntityTypes dqdl rule

0ba95ac

Add support for Conditional Aggregation Analyzer

4118c50

Add support for ConditionalAggregationAnalyzer

cbd2a06

Add support for ConditionalAggregationAnalyzer

e120ec5

Merge branch 'entityTypes' of https://github.com/joshuazexter/deequ i…

7cc655c

…nto entityTypes

joshuazexter changed the title ~~Entity types~~ ConditionalAggregationAnalyzer Jul 29, 2024

rdsharma26 reviewed Jul 29, 2024

View reviewed changes

Joshua Zexter added 2 commits July 29, 2024 17:29

Add support for CustomAggregator analyzer

25a8705

Merge branch 'entityTypes' of https://github.com/joshuazexter/deequ i…

d336fe4

…nto entityTypes

joshuazexter changed the title ~~ConditionalAggregationAnalyzer~~ CustomAggregator Jul 29, 2024

eycho-am approved these changes Jul 31, 2024

View reviewed changes

eycho-am merged commit d45db61 into awslabs:master Jul 31, 2024
1 check passed

joshuazexter deleted the entityTypes branch July 31, 2024 21:14

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

CustomAggregator (awslabs#572)

c9cf0b2

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

CustomAggregator (awslabs#572)

38ce1c4

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

CustomAggregator (awslabs#572)

6270006

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024

CustomAggregator (awslabs#572)

2889f7f

* Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CustomAggregator #572

CustomAggregator #572

joshuazexter commented Jul 29, 2024 •

edited

Loading

rdsharma26 commented Jul 29, 2024

rdsharma26 Jul 29, 2024

rdsharma26 commented Jul 29, 2024

rdsharma26 Jul 29, 2024

eycho-am left a comment

CustomAggregator #572

CustomAggregator #572

Conversation

joshuazexter commented Jul 29, 2024 • edited Loading

rdsharma26 commented Jul 29, 2024

rdsharma26 Jul 29, 2024

Choose a reason for hiding this comment

rdsharma26 commented Jul 29, 2024

rdsharma26 Jul 29, 2024

Choose a reason for hiding this comment

eycho-am left a comment

Choose a reason for hiding this comment

joshuazexter commented Jul 29, 2024 •

edited

Loading