Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207

Closed
wants to merge 7 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Apr 6, 2016

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14432

As we have the underlying implementation to calculate the approximate quantiles for multiple columns, I think we have no reason only providing API to calculate the approximate quantiles for just one column at a time. We should add API to do multiple columns too.

How was this patch tested?

Add tests to DataFrameStatSuite.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55116 has finished for PR 12207 at commit 47d52b9.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 6, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55117 has finished for PR 12207 at commit 47d52b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 7, 2016

cc @jkbradley

@@ -71,6 +71,28 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
}

/**
* Calculates the approximate quantiles of numerical columns of a DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have the full doc from the above method, we should perhaps provide an @see link to the full info about the algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Updated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the @see link work (as in links to the method with full doc)? Can you build the docs on your PR and check it? I'm not totally sure whether it will point to the doc of the other method or just to itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it with specified parameter types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will actually show up in the generated Scaladoc HTML.

@jkbradley @mengxr do you prefer to actually make links show up in the HTML API doc? If so, then it often doesn't look good in an IDE. But to do that something like this is needed:
@see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for detailed description.

@MLnick
Copy link
Contributor

MLnick commented Apr 7, 2016

Thanks @viirya - any chance to update the PySpark API at the same time? :)

@viirya
Copy link
Member Author

viirya commented Apr 7, 2016

@MLnick Thanks for review. I've updated PySpark API.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55216 has finished for PR 12207 at commit 75edcb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

ping @jkbradley @MLnick any further comments for this?

@@ -702,6 +702,14 @@ def test_approxQuantile(self):
self.assertEqual(len(aq), 3)
self.assertTrue(all(isinstance(q, float) for q in aq))

aqs = df.stat.approxQuantile(["a", "a"], [0.1, 0.5, 0.9], 0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add an assert that len(aqs) is 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55349 has finished for PR 12207 at commit 619660d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55350 has finished for PR 12207 at commit b64bd4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

@jkbradley Can you take a look too? Thanks!

@viirya
Copy link
Member Author

viirya commented Apr 11, 2016

ping @jkbradley @MLnick

@@ -1181,18 +1181,26 @@ def approxQuantile(self, col, probabilities, relativeError):
Space-efficient Online Computation of Quantile Summaries]]
by Greenwald and Khanna.

:param col: the name of the numerical column
:param cols: str, list.
The name(s) of the numerical column(s). Can be a string of the name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this comment to: Can be a single column name, or a list of names for multiple columns. I think it's clear from the specified types that it's a string name or a list of string names.

(we mention in the method doc that it operates on numerical columns, we don't need to repeat that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. updated.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55512 has finished for PR 12207 at commit 89d4d3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 11, 2016

LGTM, pending the discussion on the @see link. @jkbradley?

if isinstance(cols, tuple):
cols = list(cols)
if isinstance(cols, list):
cols = _to_list(self._sc, cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider verifying the contents of the list as done for probabilities right bellow (but just a minor point and probably not as important - just if people pass in a list of expressions rather than strings would be nice to have a useful error message).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55568 has finished for PR 12207 at commit 4309001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 13, 2016

ping @jkbradley @mengxr

@viirya
Copy link
Member Author

viirya commented Apr 15, 2016

Let me close this due to an earlier duplicate one.

@viirya viirya closed this Apr 15, 2016
@viirya viirya deleted the multi-cols-approxquantile branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants