Skip to content

Commit

Permalink
[SPARK-15709][SQL] Prevent freqItems from raising `UnsupportedOpera…
Browse files Browse the repository at this point in the history
…tionException: empty.min`

## What changes were proposed in this pull request?

Currently, `freqItems` raises `UnsupportedOperationException` on `empty.min` usually when its `support` argument is high.
```scala
scala> spark.createDataset(Seq(1, 2, 2, 3, 3, 3)).stat.freqItems(Seq("value"), 2)
16/06/01 11:11:38 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5)
java.lang.UnsupportedOperationException: empty.min
...
```

Also, the parameter checking message is wrong.
```
require(support >= 1e-4, s"support ($support) must be greater than 1e-4.")
```

This PR changes the logic to handle the `empty` case and also improves parameter checking.

## How was this patch tested?

Pass the Jenkins tests (with a new testcase).

Author: Dongjoon Hyun <[email protected]>

Closes apache#13449 from dongjoon-hyun/SPARK-15709.
  • Loading branch information
dongjoon-hyun authored and srowen committed Jun 2, 2016
1 parent 4fe7c7b commit b85d18f
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ private[sql] object FrequentItems extends Logging {
if (baseMap.size < size) {
baseMap += key -> count
} else {
val minCount = baseMap.values.min
val minCount = if (baseMap.values.isEmpty) 0 else baseMap.values.min
val remainder = count - minCount
if (remainder >= 0) {
baseMap += key -> count // something will get kicked out, so we can add this
Expand Down Expand Up @@ -83,7 +83,7 @@ private[sql] object FrequentItems extends Logging {
df: DataFrame,
cols: Seq[String],
support: Double): DataFrame = {
require(support >= 1e-4, s"support ($support) must be greater than 1e-4.")
require(support >= 1e-4 && support <= 1.0, s"Support must be in [1e-4, 1], but got $support.")
val numCols = cols.length
// number of max items to keep counts for
val sizeOfMap = (1 / support).toInt
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,17 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext {
assert(items.length === 1)
}

test("SPARK-15709: Prevent `UnsupportedOperationException: empty.min` in `freqItems`") {
val ds = spark.createDataset(Seq(1, 2, 2, 3, 3, 3))

intercept[IllegalArgumentException] {
ds.stat.freqItems(Seq("value"), 0)
}
intercept[IllegalArgumentException] {
ds.stat.freqItems(Seq("value"), 2)
}
}

test("sampleBy") {
val df = spark.range(0, 100).select((col("id") % 3).as("key"))
val sampled = df.stat.sampleBy("key", Map(0 -> 0.1, 1 -> 0.2), 0L)
Expand Down

0 comments on commit b85d18f

Please sign in to comment.