Fix performance of building row-level results #577

marcantony · 2024-08-31T15:18:25Z

Fixes #576

Iteratively calling withColumn (singular) on a DataFrame causes performance issues when iterating over a large sequence of columns. (See issue for more details.) The code was iterating to add each (name, column) pair in a map to the DataFrame, so we can use withColumns, which takes a Map[String, Column] as its parameter, as a drop-in replacement.

After running the performance test in the bug ticket, the performance is much better:

Gathering row-level results
51 columns in row level results
Duration was 67ms

Gathering row-level results
101 columns in row level results
Duration was 53ms

Gathering row-level results
151 columns in row level results
Duration was 49ms

Gathering row-level results
201 columns in row level results
Duration was 45ms

Gathering row-level results
251 columns in row level results
Duration was 48ms

Gathering row-level results
301 columns in row level results
Duration was 48ms

Gathering row-level results
351 columns in row level results
Duration was 41ms

Gathering row-level results
401 columns in row level results
Duration was 41ms

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns.

mentekid

This looks great - thank you for the in-depth analysis and the fix!

marcantony · 2024-08-31T17:57:57Z

Thanks @mentekid! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon.

mentekid · 2024-08-31T23:32:37Z

We don't have a set cadence but I think we can kick off a release some time this week for this and some other changes that have accumulated since our last one.

…

On Sat, Aug 31, 2024 at 13:58 Josh ***@***.***> wrote: Thanks @mentekid <https://github.com/mentekid>! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon. — Reply to this email directly, view it on GitHub <#577 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOFPXAAWEOXBHD3BTPHE53ZUH73XAVCNFSM6AAAAABNN6Q7NKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSHE4TIMRZGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

marcantony · 2024-09-06T18:02:04Z

Hey @mentekid, just wanted to ask about the release again. Could we maybe get something out early next week?

mentekid · 2024-09-06T21:27:37Z

Hey - what version of spark are you interested in? I think I can kick off the release for 3.5 today, the rest take time as they need separate branches and testing.

marcantony · 2024-09-06T21:31:53Z

Ah, we're actually using the 3.4 branch. No rush to kick things off today anyway because we're not quite ready on our end to fully take advantage of the performance fix yet.

marcantony · 2024-09-10T19:31:00Z

Hey @mentekid, I just realized that withColumns was added in Spark 3.3.0 so this is going to cause problems for the lower version builds. I'll make a PR today to replace it with a select instead.

* Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.4 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Match Breeze version with spark 3.3 (#562) * Updated version in pom.xml to 2.0.8-spark-3.3 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.2 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * pdated version in pom.xml to 2.0.8-spark-3.1 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID

Generate row-level results with withColumns

542dcd6

Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns.

marcantony changed the title ~~Replace iterative~~ Fix performance of building row-level results Aug 31, 2024

Add back UNIQUENESS_ID

54d4d78

marcantony marked this pull request as ready for review August 31, 2024 15:35

mentekid approved these changes Aug 31, 2024

View reviewed changes

mentekid merged commit 3b1a3ec into awslabs:master Aug 31, 2024
1 check passed

marcantony deleted the fix-row-level-results-performance branch August 31, 2024 16:49

marcantony mentioned this pull request Sep 11, 2024

Fix row-level results implementation for Spark versions <3.3 #582

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance of building row-level results #577

Fix performance of building row-level results #577

marcantony commented Aug 31, 2024 •

edited

Loading

mentekid left a comment

marcantony commented Aug 31, 2024

mentekid commented Aug 31, 2024 via email

marcantony commented Sep 6, 2024

mentekid commented Sep 6, 2024

marcantony commented Sep 6, 2024

marcantony commented Sep 10, 2024

Fix performance of building row-level results #577

Fix performance of building row-level results #577

Conversation

marcantony commented Aug 31, 2024 • edited Loading

mentekid left a comment

Choose a reason for hiding this comment

marcantony commented Aug 31, 2024

mentekid commented Aug 31, 2024 via email

marcantony commented Sep 6, 2024

mentekid commented Sep 6, 2024

marcantony commented Sep 6, 2024

marcantony commented Sep 10, 2024

marcantony commented Aug 31, 2024 •

edited

Loading