[SPARK-19436][SQL] Add missing tests for approxQuantile #16776

zhengruifeng · 2017-02-02T09:45:20Z

What changes were proposed in this pull request?

1, check the behavior with illegal quantiles and relativeError
2, add tests for relativeError > 1
3, update tests for null data
4, update some docs for javadoc8

How was this patch tested?

local test in spark-shell

SparkQA · 2017-02-02T12:05:35Z

Test build #72278 has finished for PR 16776 at commit db9ffc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-02-02T12:13:43Z

cc @gatorsmile @HyukjinKwon @holdenk @MLnick

SparkQA · 2017-02-02T12:17:57Z

Test build #72279 has finished for PR 16776 at commit 917fd6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-02T20:44:27Z

@zhengruifeng Could you split the test cases to multiple independent ones with meaningful titles？Thanks!

gatorsmile · 2017-02-02T20:45:38Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala

+    }
+
+    // quantile should be in the range [0.0, 1.0]
+    val e: IllegalArgumentException = intercept[IllegalArgumentException] {


val e: IllegalArgumentException -> val e

SparkQA · 2017-02-03T11:08:59Z

Test build #72304 has finished for PR 16776 at commit 3ea1301.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-02-03T11:10:51Z

@gatorsmile Update! Thanks for reviewing!

gatorsmile · 2017-02-03T22:59:18Z

Thank you! What is the expected output if the input dataset is empty? Could you also add a test case?

gatorsmile · 2017-02-03T23:00:51Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

@@ -80,18 +80,16 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for


You also need to fix this line, right?

4, update some docs for javadoc8

SparkQA · 2017-02-04T10:17:47Z

Test build #72364 has finished for PR 16776 at commit 92bcf05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-04T17:28:47Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala

+    assert(e2.getMessage.contains("Relative Error must be non-negative"))
+
+    // dataset should be non-empty
+    intercept[NoSuchElementException] {


We should return null, instead of throwing an exception. Could you fix it?

Do we really want to return null? Seems very un Scala like to me

SparkQA · 2017-02-05T11:40:11Z

Test build #72407 has finished for PR 16776 at commit 292cd02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-02-06T01:00:26Z

@gatorsmile Updated! Thanks for reviewing!

gatorsmile · 2017-02-08T06:34:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

-    StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
-      probabilities, relativeError).map(_.toArray).toArray
+    try {
+      StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,


We will drop the whole row if any column has null or NaN. For example,

Seq[(java.lang.Long, java.lang.Double)]((null, 1.23), (3L, null), (4L, 3.45)) .toDF("a", "b").na.drop().show()

That means, users could get different results. It depends on which API they used.

df.stat.approxQuantile("col1", Array(q1), epsilon) df.stat.approxQuantile("col2", Array(q1), epsilon)

df.stat.approxQuantile(Array("col1", "col2"), Array(q1), epsilon)

I am wondering if this is the expected behavior?

This does not sound right to me. cc @holdenk @MLnick @jkbradley

I think that might be the reason why we did not provide such an API at the beginning.

If we want to make them consistent, it does not get any performance benefit from this API. Should we revert the PR #12135 ?

@gatorsmile Good catch! Agree that this will cause confusing results.
I think there are two way to make them consistent:
1, The behavior of na-droping was included in SPARK-17219 to enhanced NaN value handling, and the single-column version of approxQuantile is only used in QuantileDiscretizer. So we can make the na-droping happen in QuantileDiscretizer, and remove the na-drop in approxQuantile.
2, modify the impl StatFunctions.multipleApproxQuantiles to deal with null and NaN, and remove the na-drop in approxQuantile.

@zhengruifeng Sure. If we want to make them consistent, I am fine. How about reverting #12135 at first? At the same time, we can work on the new solution.

Originally there was never any na dropping in approxQuantile as far as I can recall. That was added in #14858. cc @srowen

You could also simply change the na dropping to only drop from the cols passed as args for each version?

Great catch! I vote for modifying multipleApproxQuantiles to handle null and NaN values. As far as reverting, I'm OK either way as long as we get the fix into 2.2. I'd actually recommend going ahead and merging this PR and creating a follow-up Critical Bug targeted at 2.2.

@MLnick I think dropping NAs from the cols passed as args still will not work. Say the user passes cols "a" and "b" as args, but some rows have (a = NaN, b = 1.0). Then those rows will be ignored.

Let us add a TODO comment above this function and create a JIRA for tracking this issue. Thanks!

You're correct, I missed that!

Ok, so #14858 added the na dropping to approxQuantile. It wasn't there in the original impl. It can work on data that has NaN in the sense that it will return NaNs as part of the quantiles, in some cases. I'm not sure if that was the intended behavior in the original impl or not cc @thunterdb.

In that sense it's the same as aggs like min or max - those don't exclude NaN automatically, it's up to the user to handle them. So it could be best to just remove the na dropping from approxQuantile (which would revert to original impl behavior) and put it in QuantileDiscretizer which was the original need.

jkbradley · 2017-02-08T18:43:03Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

   * @param cols the names of the numerical columns
   * @param probabilities a list of quantile probabilities
   *   Each number must belong to [0, 1].
   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve (>= 0).
+   * @param relativeError The relative target precision to achieve (greater or equal to 0).


"greater" -> "greater than"

jkbradley · 2017-02-08T18:43:06Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+    if (res != null) {
+      res.head
+    } else {
+      null


The Scaladoc should describe this case

SparkQA · 2017-02-09T07:43:33Z

Test build #72634 has started for PR 16776 at commit 4db82b4.

zhengruifeng · 2017-02-10T02:01:55Z

Jenkins, retest this please

SparkQA · 2017-02-10T04:25:14Z

Test build #72676 has finished for PR 16776 at commit 4db82b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-12T05:44:33Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+      res.head
+    } else {
+      null
+    }


The above five lines can be shorten to Option(res).map(_.head).orNull

jkbradley · 2017-02-13T01:08:51Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

  }

  /**
   * Calculates the approximate quantiles of numerical columns of a DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
-   *     detailed description.
+   * @see `DataFrameStatsFunctions.approxQuantile` for detailed description.


Did this link cause problems in doc generation? It looks like there is a missing ")" in the original link. If you add the ")" then will it generate correctly (with a hyperlink) for the Scala and Java docs?

@jkbradley Do you mean @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]])? I am not sure whether it work for java docs. @HyukjinKwon Could you help review this?

SparkQA · 2017-02-13T04:18:28Z

Test build #72794 has finished for PR 16776 at commit a3171e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-13T04:24:00Z

Test build #72795 has finished for PR 16776 at commit c77755d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-13T05:54:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

  }

  /**
   * Calculates the approximate quantiles of numerical columns of a DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
-   *     detailed description.
+   * @see `[[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]]` for detailed


DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile
->
DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile)
??? How to generate the JAVA8 doc?

@HyukjinKwon

nit: DataFrameStatsFunctions -> DataFrameStatFunctions or remove it.

For example, just

`approxQuantile(String, Array[Double], Double)`

or

`approxQuantile()` with some description

We could just wrap them by backticks without [[ ... ]] in general. It seems Scaladoc specific annotation also does not work to disambiguate the argument types.

[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content [error] * @see {@link DataFrameStatFunctions.approxQuantile(col:Str* approxQuantile)} for [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:45: error: unexpected text [error] * @see #approxQuantile(String, Array[Double], Double) for detailed description. [error] ^

I guess It does not necessarily make a link if it breaks.

It seems the breaks are queued up a bit. Let me sweep it soon.

SparkQA · 2017-02-13T06:02:36Z

Test build #72802 has started for PR 16776 at commit 4b7ad19.

zhengruifeng · 2017-02-13T06:23:15Z

https://issues.apache.org/jira/browse/SPARK-19573 is created to track the issue on non-consistent na-droping.

SparkQA · 2017-02-13T10:37:56Z

Test build #72809 has finished for PR 16776 at commit 1f07901.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-02-14T10:00:15Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

  }

  /**
   * Calculates the approximate quantiles of numerical columns of a DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for


I am sorry. Actually, I initially meant remove DataFrameStatFunctions leaving the method because it is in the same class. Nevertheless, FWIW, I am fine with removing this @see as is given other functions here.

The * was me getting fancy to get the ScalaDoc to link to the correct single-arg method (I did test it at the time and it does work for Scala though there may be a mistake here somewhere).

It would still be good to provide a @see reference even if it does not link nicely (so the simple backtick method name as you suggested?)

OK, I will add it back. Should it be approxQuantile(col:Str* approxQuantile) or approxQuantile(String, Array[Double], Double) ?

SparkQA · 2017-02-14T11:52:48Z

Test build #72867 has finished for PR 16776 at commit db23d11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-14T11:57:59Z

Test build #72868 has finished for PR 16776 at commit 7c54234.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-16T07:48:54Z

Test build #72983 has finished for PR 16776 at commit 2268d36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? These error below seems caused by unidoc that does not understand double commented block. ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected [error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction = [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected [error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction = [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected [error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected [error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#' [error] * &apache#64;Override [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected [error] * &apache#64;Override [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected [error] * boolean shouldRemove = ...; // Decide whether to remove the state [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected [error] * if (shouldRemove) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected [error] * } else { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected [error] * state.update(newState); // Set the new state [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected [error] * } [error] ^ [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected [error] * state.update(initialState); [error] ^ [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected [error] * } [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected [error] * </code></pre> [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected [error] * tparam S User-defined type of the state to be stored for each key. Must be encodable into [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected [error] * Spark SQL types (see {link Encoder} for more details). [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected [error] * since 2.1.1 [error] ^ ``` And another link seems unrecognisable. ``` .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found [error] * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery}, [error] ``` Note that this PR does not fix the two breaks as below: ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content [error] * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>' [error] * param relativeError The relative target precision to achieve (>= 0). [error] ^ [error] ``` because these seem probably fixed soon in apache#16776 and I intended to avoid potential conflicts. ## How was this patch tested? Manually via `jekyll build` Author: hyukjinkwon <[email protected]> Closes apache#16926 from HyukjinKwon/javadoc-break.

gatorsmile · 2017-02-16T17:40:43Z

LGTM

gatorsmile · 2017-02-16T17:41:18Z

Thanks! Merging to master.

Please continue to finish the work https://issues.apache.org/jira/browse/SPARK-19573

thunterdb · 2017-02-17T18:58:16Z

Sorry I missed the conversation here. LGTM.

zhengruifeng changed the title ~~[SPARK-14352][FOLLOWUP][SQL] add tests for approxQuantile &~~ [SPARK-14352][FOLLOWUP][SQL] update tests for approxQuantile Feb 2, 2017

zhengruifeng changed the title ~~[SPARK-14352][FOLLOWUP][SQL] update tests for approxQuantile~~ [SPARK-19436][SQL] Add missing tests for approxQuantile Feb 2, 2017

gatorsmile reviewed Feb 2, 2017

View reviewed changes

zhengruifeng force-pushed the fix_approxQuantile branch from 917fd6e to 3ea1301 Compare February 3, 2017 08:49

gatorsmile reviewed Feb 3, 2017

View reviewed changes

zhengruifeng force-pushed the fix_approxQuantile branch from 3ea1301 to 92bcf05 Compare February 4, 2017 07:59

gatorsmile reviewed Feb 4, 2017

View reviewed changes

zhengruifeng force-pushed the fix_approxQuantile branch from 92bcf05 to 6262f24 Compare February 5, 2017 09:20

gatorsmile reviewed Feb 8, 2017

View reviewed changes

jkbradley reviewed Feb 8, 2017

View reviewed changes

zhengruifeng force-pushed the fix_approxQuantile branch from 292cd02 to 4db82b4 Compare February 9, 2017 07:42

gatorsmile reviewed Feb 12, 2017

View reviewed changes

jkbradley reviewed Feb 13, 2017

View reviewed changes

zhengruifeng force-pushed the fix_approxQuantile branch from 4db82b4 to a3171e4 Compare February 13, 2017 01:57

gatorsmile reviewed Feb 13, 2017

View reviewed changes

zhengruifeng added 12 commits February 14, 2017 16:31

create pr

4d42cfd

update doc

f259543

update tests

5f456b2

update tests

cf0b808

retuen null for empty input

9a8fc1e

retuen null for empty input

4345709

update doc

2fbe21a

add TODO & update @see

322141a

fix one nit

e5bac53

add )

fa1069a

update doc

d3e3ee0

update see annotation

db23d11

zhengruifeng force-pushed the fix_approxQuantile branch from 1f07901 to db23d11 Compare February 14, 2017 09:31

update see annotation

7c54234

HyukjinKwon reviewed Feb 14, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Feb 14, 2017

[MINOR][BUILD] Fix javadoc8 break #16926

Closed

update see

2268d36

asfgit closed this in 54a30c8 Feb 16, 2017

zhengruifeng deleted the fix_approxQuantile branch February 17, 2017 02:19

		@@ -80,18 +80,16 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
		* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for

[SPARK-19436][SQL] Add missing tests for approxQuantile #16776

[SPARK-19436][SQL] Add missing tests for approxQuantile #16776

Conversation

zhengruifeng commented Feb 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 2, 2017

zhengruifeng commented Feb 2, 2017

SparkQA commented Feb 2, 2017

gatorsmile commented Feb 2, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 3, 2017

zhengruifeng commented Feb 3, 2017

gatorsmile commented Feb 3, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2017

zhengruifeng commented Feb 6, 2017

Choose a reason for hiding this comment

gatorsmile Feb 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 9, 2017

zhengruifeng commented Feb 10, 2017

SparkQA commented Feb 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 13, 2017

SparkQA commented Feb 13, 2017

gatorsmile Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 13, 2017

zhengruifeng commented Feb 13, 2017

SparkQA commented Feb 13, 2017

HyukjinKwon Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 14, 2017

SparkQA commented Feb 14, 2017

SparkQA commented Feb 16, 2017

gatorsmile commented Feb 16, 2017

gatorsmile commented Feb 16, 2017

thunterdb commented Feb 17, 2017

gatorsmile Feb 8, 2017 •

edited

Loading

gatorsmile Feb 13, 2017 •

edited

Loading

HyukjinKwon Feb 14, 2017 •

edited

Loading

HyukjinKwon Feb 14, 2017 •

edited

Loading