-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207
Conversation
Test build #55116 has finished for PR 12207 at commit
|
retest this please. |
Test build #55117 has finished for PR 12207 at commit
|
cc @jkbradley |
@@ -71,6 +71,28 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { | |||
} | |||
|
|||
/** | |||
* Calculates the approximate quantiles of numerical columns of a DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have the full doc from the above method, we should perhaps provide an @see
link to the full info about the algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the @see
link work (as in links to the method with full doc)? Can you build the docs on your PR and check it? I'm not totally sure whether it will point to the doc of the other method or just to itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it with specified parameter types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will actually show up in the generated Scaladoc HTML.
@jkbradley @mengxr do you prefer to actually make links show up in the HTML API doc? If so, then it often doesn't look good in an IDE. But to do that something like this is needed:
@see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for detailed description.
Thanks @viirya - any chance to update the PySpark API at the same time? :) |
@MLnick Thanks for review. I've updated PySpark API. |
Test build #55216 has finished for PR 12207 at commit
|
ping @jkbradley @MLnick any further comments for this? |
@@ -702,6 +702,14 @@ def test_approxQuantile(self): | |||
self.assertEqual(len(aq), 3) | |||
self.assertTrue(all(isinstance(q, float) for q in aq)) | |||
|
|||
aqs = df.stat.approxQuantile(["a", "a"], [0.1, 0.5, 0.9], 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add an assert that len(aqs)
is 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
Test build #55349 has finished for PR 12207 at commit
|
Test build #55350 has finished for PR 12207 at commit
|
@jkbradley Can you take a look too? Thanks! |
ping @jkbradley @MLnick |
@@ -1181,18 +1181,26 @@ def approxQuantile(self, col, probabilities, relativeError): | |||
Space-efficient Online Computation of Quantile Summaries]] | |||
by Greenwald and Khanna. | |||
|
|||
:param col: the name of the numerical column | |||
:param cols: str, list. | |||
The name(s) of the numerical column(s). Can be a string of the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this comment to: Can be a single column name, or a list of names for multiple columns
. I think it's clear from the specified types that it's a string name or a list of string names.
(we mention in the method doc that it operates on numerical columns, we don't need to repeat that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. updated.
Test build #55512 has finished for PR 12207 at commit
|
LGTM, pending the discussion on the |
if isinstance(cols, tuple): | ||
cols = list(cols) | ||
if isinstance(cols, list): | ||
cols = _to_list(self._sc, cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider verifying the contents of the list as done for probabilities right bellow (but just a minor point and probably not as important - just if people pass in a list of expressions rather than strings would be nice to have a useful error message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Test build #55568 has finished for PR 12207 at commit
|
ping @jkbradley @mengxr |
Let me close this due to an earlier duplicate one. |
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14432
As we have the underlying implementation to calculate the approximate quantiles for multiple columns, I think we have no reason only providing API to calculate the approximate quantiles for just one column at a time. We should add API to do multiple columns too.
How was this patch tested?
Add tests to
DataFrameStatSuite
.