Add support for Conditional Aggregation Analyzer #571

joshuazexter · 2024-07-29T14:10:01Z

This pull request introduces the ConditionalAggregationAnalyzer, a tool designed for dynamic data aggregation based on user-specified conditions within Apache Spark DataFrames. This addition aims to enhance the Deequ library's capabilities in performing customized metric calculations and aggregations, making it applicable across a variety of use cases where conditional data aggregation is required.

Core Features:

Custom Aggregation Logic: Users can pass a lambda function that specifies how data should be aggregated. This function is applied to a DataFrame to compute a state representing the aggregation result.
Generic Metric Computation: Post aggregation, the analyzer computes metrics from the aggregated data state, facilitating easy integration with existing monitoring or reporting systems.
Versatility in Use Cases: Whether it's analyzing sales data, customer feedback, or operational metrics, this analyzer provides the tools necessary to extract meaningful insights from complex datasets.

Usage Examples:
Included in the pull request are unit tests that demonstrate potential use cases:

Content Engagement Metrics:

Use Case: Media companies and content providers often need to measure the engagement levels of various content types across different platforms to optimize their offerings.
Example: Use the analyzer to aggregate views, likes, and shares of articles or videos across different content categories (e.g., sports, news, entertainment) to calculate engagement percentages that help in identifying the most popular content types.

Operational Efficiency Monitoring:

Use Case: In manufacturing or IT operations, monitoring the efficiency of processes or systems is crucial. This analyzer can aggregate operational data to track efficiency metrics like downtime, throughput, or error rates.
Example: Aggregate and compute the frequency of downtime incidents across different machines or systems to identify patterns or potential areas for maintenance improvements.

How It Can Be Used:
To use the ConditionalAggregationAnalyzer, developers will need to:

Define a lambda function that describes the aggregation logic specific to their data and requirements on a specific column.
Instantiate the analyzer with this function, specifying the relevant metric names and instances.
Apply the analyzer to a DataFrame within a Spark session to compute and retrieve metrics.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Joshua Zexter added 2 commits July 29, 2024 10:07

Add support for EntityTypes dqdl rule

0ba95ac

Add support for Conditional Aggregation Analyzer

4118c50

joshuazexter closed this Jul 29, 2024

Add support for ConditionalAggregationAnalyzer

cbd2a06

joshuazexter reopened this Jul 29, 2024

joshuazexter closed this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Conditional Aggregation Analyzer #571

Add support for Conditional Aggregation Analyzer #571

joshuazexter commented Jul 29, 2024 •

edited

Loading

Add support for Conditional Aggregation Analyzer #571

Add support for Conditional Aggregation Analyzer #571

Conversation

joshuazexter commented Jul 29, 2024 • edited Loading

joshuazexter commented Jul 29, 2024 •

edited

Loading