[Minor][ML] Refactor clustering summary. #15555

yanboliang · 2016-10-19T13:51:48Z

What changes were proposed in this pull request?

Abstract ClusteringSummary from KMeansSummary, GaussianMixtureSummary and BisectingSummary, and eliminate duplicated pieces of code.

How was this patch tested?

Existing tests.

SparkQA · 2016-10-19T14:48:34Z

Test build #67198 has finished for PR 15555 at commit 3169267.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-10-20T02:10:18Z

cc @zhengruifeng @srowen

zhengruifeng · 2016-10-20T02:19:26Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Experimental


what about adding @Since("2.1.0") here?
Create a new scala file named Clustering.scala and move ClusteringSummary into it?

ClusteringSummary will be succeeded by summaries who were added in different version, so I think we should not add since version here. To the issue for a new file, I think ClusteringSummary is a small class, we can place it here temporarily.

I'm not entirely certain on the official policy for the @Since tags, but it seems better to me to put @Since("2.1.0") here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though.

I'm also ambivalent about this, the reason behind my change is that some classes such as KMeansSummary and GaussianMixtureSummary were added at 2.0. If I put @Since("2.1.0") here, it looks not quite right, but I'm not sure whether it's OK. @jkbradley What's your opinion? Thanks.

yanboliang · 2016-10-22T09:09:13Z

@srowen Would you mind to have a look when you available? Thanks.

sethah

Left only minor comments.

sethah · 2016-10-24T21:53:38Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

+    predictionCol: String,
+    featuresCol: String,
+    k: Int)
+  extends ClusteringSummary (


minor: this can fit on one line like: k: Int) extends ClusteringSummary(..., ..., ...)

sethah · 2016-10-24T21:54:36Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+ * :: Experimental ::
+ * Summary of clustering.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform()


nit: Add periods for each line

sethah · 2016-10-24T22:33:10Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Experimental


I'm not entirely certain on the official policy for the @Since tags, but it seems better to me to put @Since("2.1.0") here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though.

sethah · 2016-10-24T22:34:09Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+
+/**
+ * :: Experimental ::
+ * Summary of clustering.


"Summary of clustering algorithms." ?

SparkQA · 2016-10-25T04:48:35Z

Test build #67490 has finished for PR 15555 at commit f13f240.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-10-25T06:31:58Z

jenkins test this please

SparkQA · 2016-10-25T07:28:26Z

Test build #67494 has finished for PR 15555 at commit f13f240.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-25T10:30:04Z

Looks good to me.

sethah · 2016-10-25T20:28:36Z

@yanboliang What do you think about adding the @Since tags?

jkbradley · 2016-10-25T20:48:46Z

I'd say add Since tags wherever applicable in concrete classes. But if it's an abstract method, we probably should not since they will be incorrect for new child classes later on.

jkbradley · 2016-10-25T20:53:02Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

-  @transient lazy val cluster: DataFrame = predictions.select(predictionCol)
+    predictions: DataFrame,
+    predictionCol: String,
+    val probabilityCol: String,


could do a Since tag here

jkbradley · 2016-10-25T20:53:45Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+ * @param k  Number of clusters.
+ */
+@Experimental
+class ClusteringSummary private[clustering] (


If this is generic to clustering, how about putting it in a new file?

SparkQA · 2016-10-26T10:55:07Z

Test build #67575 has finished for PR 15555 at commit 946ee73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-26T18:48:11Z

LGTM
I'll merge this with master
Thanks!

## What changes were proposed in this pull request? Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code. ## How was this patch tested? Existing tests. Author: Yanbo Liang <[email protected]> Closes apache#15555 from yanboliang/clustering-summary.

Refactor clustering summary.

3169267

zhengruifeng reviewed Oct 20, 2016

View reviewed changes

sethah reviewed Oct 24, 2016

View reviewed changes

Address comments.

f13f240

jkbradley reviewed Oct 25, 2016

View reviewed changes

Move ClusteringSummary to a separate file.

946ee73

asfgit closed this in ea3605e Oct 26, 2016

yanboliang deleted the clustering-summary branch October 26, 2016 23:06

zhengruifeng mentioned this pull request Oct 27, 2016

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression #15435

Closed

sethah mentioned this pull request Nov 5, 2016

[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minor][ML] Refactor clustering summary. #15555

[Minor][ML] Refactor clustering summary. #15555

yanboliang commented Oct 19, 2016

SparkQA commented Oct 19, 2016

yanboliang commented Oct 20, 2016

zhengruifeng Oct 20, 2016

yanboliang Oct 20, 2016

sethah Oct 24, 2016

yanboliang Oct 25, 2016

yanboliang commented Oct 22, 2016

sethah left a comment

sethah Oct 24, 2016

sethah Oct 24, 2016

sethah Oct 24, 2016

sethah Oct 24, 2016

SparkQA commented Oct 25, 2016

yanboliang commented Oct 25, 2016

SparkQA commented Oct 25, 2016

zhengruifeng commented Oct 25, 2016

sethah commented Oct 25, 2016

jkbradley commented Oct 25, 2016

jkbradley Oct 25, 2016

jkbradley Oct 25, 2016

SparkQA commented Oct 26, 2016

jkbradley commented Oct 26, 2016

[Minor][ML] Refactor clustering summary. #15555

[Minor][ML] Refactor clustering summary. #15555

Conversation

yanboliang commented Oct 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 19, 2016

yanboliang commented Oct 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang commented Oct 22, 2016

sethah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 25, 2016

yanboliang commented Oct 25, 2016

SparkQA commented Oct 25, 2016

zhengruifeng commented Oct 25, 2016

sethah commented Oct 25, 2016

jkbradley commented Oct 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 26, 2016

jkbradley commented Oct 26, 2016