-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19436][SQL] Add missing tests for approxQuantile #16776
Conversation
Test build #72278 has finished for PR 16776 at commit
|
Test build #72279 has finished for PR 16776 at commit
|
@zhengruifeng Could you split the test cases to multiple independent ones with meaningful titles?Thanks! |
} | ||
|
||
// quantile should be in the range [0.0, 1.0] | ||
val e: IllegalArgumentException = intercept[IllegalArgumentException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val e: IllegalArgumentException
-> val e
917fd6e
to
3ea1301
Compare
Test build #72304 has finished for PR 16776 at commit
|
@gatorsmile Update! Thanks for reviewing! |
Thank you! What is the expected output if the input dataset is empty? Could you also add a test case? |
@@ -80,18 +80,16 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { | |||
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to fix this line, right?
4, update some docs for javadoc8
3ea1301
to
92bcf05
Compare
Test build #72364 has finished for PR 16776 at commit
|
assert(e2.getMessage.contains("Relative Error must be non-negative")) | ||
|
||
// dataset should be non-empty | ||
intercept[NoSuchElementException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return null
, instead of throwing an exception. Could you fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to return null? Seems very un Scala like to me
92bcf05
to
6262f24
Compare
Test build #72407 has finished for PR 16776 at commit
|
@gatorsmile Updated! Thanks for reviewing! |
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, | ||
probabilities, relativeError).map(_.toArray).toArray | ||
try { | ||
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will drop the whole row if any column has null
or NaN
. For example,
Seq[(java.lang.Long, java.lang.Double)]((null, 1.23), (3L, null), (4L, 3.45))
.toDF("a", "b").na.drop().show()
That means, users could get different results. It depends on which API they used.
df.stat.approxQuantile("col1", Array(q1), epsilon)
df.stat.approxQuantile("col2", Array(q1), epsilon)
df.stat.approxQuantile(Array("col1", "col2"), Array(q1), epsilon)
I am wondering if this is the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not sound right to me. cc @holdenk @MLnick @jkbradley
I think that might be the reason why we did not provide such an API at the beginning.
If we want to make them consistent, it does not get any performance benefit from this API. Should we revert the PR #12135 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Good catch! Agree that this will cause confusing results.
I think there are two way to make them consistent:
1, The behavior of na-droping was included in SPARK-17219 to enhanced NaN value handling, and the single-column version of approxQuantile
is only used in QuantileDiscretizer
. So we can make the na-droping happen in QuantileDiscretizer
, and remove the na-drop in approxQuantile
.
2, modify the impl StatFunctions.multipleApproxQuantiles
to deal with null and NaN, and remove the na-drop in approxQuantile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng Sure. If we want to make them consistent, I am fine. How about reverting #12135 at first? At the same time, we can work on the new solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I vote for modifying multipleApproxQuantiles to handle null and NaN values. As far as reverting, I'm OK either way as long as we get the fix into 2.2. I'd actually recommend going ahead and merging this PR and creating a follow-up Critical Bug targeted at 2.2.
@MLnick I think dropping NAs from the cols passed as args still will not work. Say the user passes cols "a" and "b" as args, but some rows have (a = NaN, b = 1.0). Then those rows will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us add a TODO comment above this function and create a JIRA for tracking this issue. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, I missed that!
Ok, so #14858 added the na
dropping to approxQuantile
. It wasn't there in the original impl. It can work on data that has NaN
in the sense that it will return NaN
s as part of the quantiles, in some cases. I'm not sure if that was the intended behavior in the original impl or not cc @thunterdb.
In that sense it's the same as aggs like min
or max
- those don't exclude NaN
automatically, it's up to the user to handle them. So it could be best to just remove the na
dropping from approxQuantile
(which would revert to original impl behavior) and put it in QuantileDiscretizer
which was the original need.
* @param cols the names of the numerical columns | ||
* @param probabilities a list of quantile probabilities | ||
* Each number must belong to [0, 1]. | ||
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum. | ||
* @param relativeError The relative target precision to achieve (>= 0). | ||
* @param relativeError The relative target precision to achieve (greater or equal to 0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"greater" -> "greater than"
if (res != null) { | ||
res.head | ||
} else { | ||
null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Scaladoc should describe this case
292cd02
to
4db82b4
Compare
Test build #72634 has started for PR 16776 at commit |
Jenkins, retest this please |
Test build #72676 has finished for PR 16776 at commit
|
res.head | ||
} else { | ||
null | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above five lines can be shorten to Option(res).map(_.head).orNull
} | ||
|
||
/** | ||
* Calculates the approximate quantiles of numerical columns of a DataFrame. | ||
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for | ||
* detailed description. | ||
* @see `DataFrameStatsFunctions.approxQuantile` for detailed description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this link cause problems in doc generation? It looks like there is a missing ")" in the original link. If you add the ")" then will it generate correctly (with a hyperlink) for the Scala and Java docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley Do you mean @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]])
? I am not sure whether it work for java docs. @HyukjinKwon Could you help review this?
4db82b4
to
a3171e4
Compare
Test build #72794 has finished for PR 16776 at commit
|
Test build #72795 has finished for PR 16776 at commit
|
} | ||
|
||
/** | ||
* Calculates the approximate quantiles of numerical columns of a DataFrame. | ||
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for | ||
* detailed description. | ||
* @see `[[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]]` for detailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile
->
DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile)
??? How to generate the JAVA8 doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DataFrameStatsFunctions
-> DataFrameStatFunctions
or remove it.
For example, just
`approxQuantile(String, Array[Double], Double)`
or
`approxQuantile()` with some description
We could just wrap them by backticks without [[ ... ]]
in general. It seems Scaladoc specific annotation also does not work to disambiguate the argument types.
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error] * @see {@link DataFrameStatFunctions.approxQuantile(col:Str* approxQuantile)} for
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:45: error: unexpected text
[error] * @see #approxQuantile(String, Array[Double], Double) for detailed description.
[error] ^
I guess It does not necessarily make a link if it breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the breaks are queued up a bit. Let me sweep it soon.
Test build #72802 has started for PR 16776 at commit |
https://issues.apache.org/jira/browse/SPARK-19573 is created to track the issue on non-consistent na-droping. |
Test build #72809 has finished for PR 16776 at commit
|
1f07901
to
db23d11
Compare
} | ||
|
||
/** | ||
* Calculates the approximate quantiles of numerical columns of a DataFrame. | ||
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry. Actually, I initially meant remove DataFrameStatFunctions
leaving the method because it is in the same class. Nevertheless, FWIW, I am fine with removing this @see
as is given other functions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The *
was me getting fancy to get the ScalaDoc to link to the correct single-arg method (I did test it at the time and it does work for Scala though there may be a mistake here somewhere).
It would still be good to provide a @see
reference even if it does not link nicely (so the simple backtick method name as you suggested?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add it back. Should it be approxQuantile(col:Str* approxQuantile)
or approxQuantile(String, Array[Double], Double)
?
Test build #72867 has finished for PR 16776 at commit
|
Test build #72868 has finished for PR 16776 at commit
|
Test build #72983 has finished for PR 16776 at commit
|
## What changes were proposed in this pull request? These error below seems caused by unidoc that does not understand double commented block. ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected [error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction = [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected [error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction = [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected [error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected [error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#' [error] * &apache#64;Override [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected [error] * &apache#64;Override [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected [error] * boolean shouldRemove = ...; // Decide whether to remove the state [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected [error] * if (shouldRemove) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected [error] * } else { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected [error] * state.update(newState); // Set the new state [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected [error] * } [error] ^ [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected [error] * state.update(initialState); [error] ^ [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected [error] * } [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected [error] * </code></pre> [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected [error] * tparam S User-defined type of the state to be stored for each key. Must be encodable into [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected [error] * Spark SQL types (see {link Encoder} for more details). [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected [error] * since 2.1.1 [error] ^ ``` And another link seems unrecognisable. ``` .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found [error] * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery}, [error] ``` Note that this PR does not fix the two breaks as below: ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content [error] * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>' [error] * param relativeError The relative target precision to achieve (>= 0). [error] ^ [error] ``` because these seem probably fixed soon in apache#16776 and I intended to avoid potential conflicts. ## How was this patch tested? Manually via `jekyll build` Author: hyukjinkwon <[email protected]> Closes apache#16926 from HyukjinKwon/javadoc-break.
LGTM |
Thanks! Merging to master. Please continue to finish the work https://issues.apache.org/jira/browse/SPARK-19573 |
Sorry I missed the conversation here. LGTM. |
What changes were proposed in this pull request?
1, check the behavior with illegal
quantiles
andrelativeError
2, add tests for
relativeError
> 13, update tests for
null
data4, update some docs for javadoc8
How was this patch tested?
local test in spark-shell