[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207

viirya · 2016-04-06T14:31:24Z

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14432

As we have the underlying implementation to calculate the approximate quantiles for multiple columns, I think we have no reason only providing API to calculate the approximate quantiles for just one column at a time. We should add API to do multiple columns too.

How was this patch tested?

Add tests to DataFrameStatSuite.

SparkQA · 2016-04-06T14:39:49Z

Test build #55116 has finished for PR 12207 at commit 47d52b9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-06T14:42:30Z

retest this please.

SparkQA · 2016-04-06T16:06:25Z

Test build #55117 has finished for PR 12207 at commit 47d52b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-07T02:43:14Z

cc @jkbradley

MLnick · 2016-04-07T07:41:03Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

@@ -71,6 +71,28 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
  }

  /**
+   * Calculates the approximate quantiles of numerical columns of a DataFrame.


If we don't have the full doc from the above method, we should perhaps provide an @see link to the full info about the algorithm?

Ok. Updated it.

Does the @see link work (as in links to the method with full doc)? Can you build the docs on your PR and check it? I'm not totally sure whether it will point to the doc of the other method or just to itself.

I've updated it with specified parameter types.

I'm not sure this will actually show up in the generated Scaladoc HTML.

@jkbradley @mengxr do you prefer to actually make links show up in the HTML API doc? If so, then it often doesn't look good in an IDE. But to do that something like this is needed:
@see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for detailed description.

MLnick · 2016-04-07T07:48:03Z

Thanks @viirya - any chance to update the PySpark API at the same time? :)

viirya · 2016-04-07T10:37:11Z

@MLnick Thanks for review. I've updated PySpark API.

SparkQA · 2016-04-07T11:58:12Z

Test build #55216 has finished for PR 12207 at commit 75edcb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-08T13:10:47Z

ping @jkbradley @MLnick any further comments for this?

MLnick · 2016-04-08T13:45:23Z

python/pyspark/sql/tests.py

@@ -702,6 +702,14 @@ def test_approxQuantile(self):
        self.assertEqual(len(aq), 3)
        self.assertTrue(all(isinstance(q, float) for q in aq))

+        aqs = df.stat.approxQuantile(["a", "a"], [0.1, 0.5, 0.9], 0.1)


shall we add an assert that len(aqs) is 2?

SparkQA · 2016-04-08T15:53:36Z

Test build #55349 has finished for PR 12207 at commit 619660d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T15:59:59Z

Test build #55350 has finished for PR 12207 at commit b64bd4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-08T23:43:44Z

@jkbradley Can you take a look too? Thanks!

viirya · 2016-04-11T03:09:02Z

ping @jkbradley @MLnick

MLnick · 2016-04-11T06:52:36Z

python/pyspark/sql/dataframe.py

@@ -1181,18 +1181,26 @@ def approxQuantile(self, col, probabilities, relativeError):
        Space-efficient Online Computation of Quantile Summaries]]
        by Greenwald and Khanna.

-        :param col: the name of the numerical column
+        :param cols: str, list.
+            The name(s) of the numerical column(s). Can be a string of the name


I think we can simplify this comment to: Can be a single column name, or a list of names for multiple columns. I think it's clear from the specified types that it's a string name or a list of string names.

(we mention in the method doc that it operates on numerical columns, we don't need to repeat that).

ok. updated.

SparkQA · 2016-04-11T08:30:41Z

Test build #55512 has finished for PR 12207 at commit 89d4d3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-11T09:37:59Z

LGTM, pending the discussion on the @see link. @jkbradley?

holdenk · 2016-04-11T21:04:54Z

python/pyspark/sql/dataframe.py

+        if isinstance(cols, tuple):
+            cols = list(cols)
+        if isinstance(cols, list):
+            cols = _to_list(self._sc, cols)


We could consider verifying the contents of the list as done for probabilities right bellow (but just a minor point and probably not as important - just if people pass in a list of expressions rather than strings would be nice to have a useful error message).

SparkQA · 2016-04-12T03:23:48Z

Test build #55568 has finished for PR 12207 at commit 4309001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-13T07:29:33Z

ping @jkbradley @mengxr

viirya · 2016-04-15T03:53:31Z

Let me close this due to an earlier duplicate one.

viirya added 2 commits April 6, 2016 14:22

Add API to compute approxQuantile for multiple columns.

a8f1b33

Add comment.

47d52b9

viirya mentioned this pull request Apr 6, 2016

[SPARK-13568] [ML] Create feature transformer to impute missing values #11601

Closed

MLnick reviewed Apr 7, 2016
View reviewed changes

Address comments and change Python API too.

75edcb1

MLnick reviewed Apr 8, 2016
View reviewed changes

viirya added 2 commits April 8, 2016 14:30

Address comments.

619660d

Update comment.

b64bd4e

MLnick reviewed Apr 11, 2016
View reviewed changes

Slightly modify comment.

89d4d3e

holdenk reviewed Apr 11, 2016
View reviewed changes

Check the content of list.

4309001

MLnick mentioned this pull request Apr 14, 2016

[SPARK-14352][SQL] approxQuantile should support multi columns #12135

Closed

viirya closed this Apr 15, 2016

viirya deleted the multi-cols-approxquantile branch December 27, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207

[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207

viirya commented Apr 6, 2016

SparkQA commented Apr 6, 2016

viirya commented Apr 6, 2016

SparkQA commented Apr 6, 2016

viirya commented Apr 7, 2016

MLnick Apr 7, 2016

viirya Apr 7, 2016

MLnick Apr 8, 2016

viirya Apr 8, 2016

MLnick Apr 11, 2016

MLnick commented Apr 7, 2016

viirya commented Apr 7, 2016

SparkQA commented Apr 7, 2016

viirya commented Apr 8, 2016

MLnick Apr 8, 2016

viirya Apr 8, 2016

SparkQA commented Apr 8, 2016

SparkQA commented Apr 8, 2016

viirya commented Apr 8, 2016

viirya commented Apr 11, 2016

MLnick Apr 11, 2016

viirya Apr 11, 2016

SparkQA commented Apr 11, 2016

MLnick commented Apr 11, 2016

holdenk Apr 11, 2016

viirya Apr 12, 2016

SparkQA commented Apr 12, 2016

viirya commented Apr 13, 2016

viirya commented Apr 15, 2016

[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207

[SPARK-14432][SQL] Add API to calculate the approximate quantiles for multiple columns #12207

Conversation

viirya commented Apr 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 6, 2016

viirya commented Apr 6, 2016

SparkQA commented Apr 6, 2016

viirya commented Apr 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MLnick commented Apr 7, 2016

viirya commented Apr 7, 2016

SparkQA commented Apr 7, 2016

viirya commented Apr 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 8, 2016

SparkQA commented Apr 8, 2016

viirya commented Apr 8, 2016

viirya commented Apr 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 11, 2016

MLnick commented Apr 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 12, 2016

viirya commented Apr 13, 2016

viirya commented Apr 15, 2016