Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance of building row-level results #577

Merged
merged 2 commits into from
Aug 31, 2024

Conversation

marcantony
Copy link
Contributor

@marcantony marcantony commented Aug 31, 2024

Fixes #576

Iteratively calling withColumn (singular) on a DataFrame causes performance issues when iterating over a large sequence of columns. (See issue for more details.) The code was iterating to add each (name, column) pair in a map to the DataFrame, so we can use withColumns, which takes a Map[String, Column] as its parameter, as a drop-in replacement.

After running the performance test in the bug ticket, the performance is much better:

Gathering row-level results
51 columns in row level results
Duration was 67ms

Gathering row-level results
101 columns in row level results
Duration was 53ms

Gathering row-level results
151 columns in row level results
Duration was 49ms

Gathering row-level results
201 columns in row level results
Duration was 45ms

Gathering row-level results
251 columns in row level results
Duration was 48ms

Gathering row-level results
301 columns in row level results
Duration was 48ms

Gathering row-level results
351 columns in row level results
Duration was 41ms

Gathering row-level results
401 columns in row level results
Duration was 41ms

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.
@marcantony marcantony changed the title Replace iterative Fix performance of building row-level results Aug 31, 2024
@marcantony marcantony marked this pull request as ready for review August 31, 2024 15:35
Copy link
Contributor

@mentekid mentekid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - thank you for the in-depth analysis and the fix!

@mentekid mentekid merged commit 3b1a3ec into awslabs:master Aug 31, 2024
1 check passed
@marcantony marcantony deleted the fix-row-level-results-performance branch August 31, 2024 16:49
@marcantony
Copy link
Contributor Author

Thanks @mentekid! By the way, any idea how soon a version with this can get published? We're trying to support some users with huge numbers of checks (hundreds to thousands) so I'm hoping to incorporate this in our application soon.

@mentekid
Copy link
Contributor

mentekid commented Aug 31, 2024 via email

@marcantony
Copy link
Contributor Author

Hey @mentekid, just wanted to ask about the release again. Could we maybe get something out early next week?

@mentekid
Copy link
Contributor

mentekid commented Sep 6, 2024

Hey - what version of spark are you interested in? I think I can kick off the release for 3.5 today, the rest take time as they need separate branches and testing.

@marcantony
Copy link
Contributor Author

Ah, we're actually using the 3.4 branch. No rush to kick things off today anyway because we're not quite ready on our end to fully take advantage of the performance fix yet.

@marcantony
Copy link
Contributor Author

Hey @mentekid, I just realized that withColumns was added in Spark 3.3.0 so this is going to cause problems for the lower version builds. I'll make a PR today to replace it with a select instead.

eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID
eycho-am pushed a commit to eycho-am/deequ that referenced this pull request Oct 9, 2024
* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* Updated version in pom.xml to 2.0.8-spark-3.4

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* Match Breeze version with spark 3.3 (#562)

* Updated version in pom.xml to 2.0.8-spark-3.3

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* Updated version in pom.xml to 2.0.8-spark-3.2

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
mentekid pushed a commit that referenced this pull request Oct 9, 2024
* Configurable RetainCompletenessRule (#564)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Optional specification of instance name in CustomSQL analyzer metric. (#569)

Co-authored-by: Tyler Mcdaniel <[email protected]>

* Adding Wilson Score Confidence Interval Strategy (#567)

* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import

* CustomAggregator (#572)

* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>

* fix typo (#574)

* Fix performance of building row-level results (#577)

* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID

* Replace 'withColumns' with 'select' (#582)

'withColumns' was introduced in Spark 3.3, so it won't
work for Deequ's <3.3 builds.

* Replace rdd with dataframe functions in Histogram analyzer (#586)

Co-authored-by: Shriya Vanvari <[email protected]>

* pdated version in pom.xml to 2.0.8-spark-3.1

---------

Co-authored-by: zeotuan <[email protected]>
Co-authored-by: tylermcdaniel0 <[email protected]>
Co-authored-by: Tyler Mcdaniel <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: Joshua Zexter <[email protected]>
Co-authored-by: bojackli <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
Co-authored-by: Shriya Vanvari <[email protected]>
arsenalgunnershubert777 pushed a commit to arsenalgunnershubert777/deequ that referenced this pull request Nov 8, 2024
* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Performance for building row-level results scales poorly with number of checks
2 participants