[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777

sethah · 2016-11-05T00:33:17Z

What changes were proposed in this pull request?

Add model summary APIs for GaussianMixtureModel and BisectingKMeansModel in pyspark.

How was this patch tested?

Unit tests.

sethah · 2016-11-05T00:34:25Z

python/pyspark/ml/classification.py

+            # Note: Once multiclass is added, update this to return correct summary
+            return BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
+        else:
+            raise RuntimeError("No training summary available for this %s" %


Before, this would throw a Py4JJavaError. I think it's slightly better to throw a RuntimeError here as is done in Scala.

I think thats generally a good improvement, the Py4J errors are often confusing to end users.

I like this change, we should always throw an exception easy to understand by users.

sethah · 2016-11-05T00:37:26Z

python/pyspark/ml/tests.py

+        self.assertEqual(len(s.clusterSizes), 2)
+        self.assertEqual(s.k, 2)
+
+        # TODO: test when there is no summary


We should test that hasSummary returns False when there is no summary available, and that summary throws an exception. The problem I'm having is that I'm not sure how to create this test case. The only way to get a model is by calling fit, which will produce a model with a summary. Calling model._call_java("setSummary", None) doesn't work either. Is there some way that I'm missing?

It might make sense to update setSummary to treat null as empty (e.g. Option instead of Some)) for easy testing.

SparkQA · 2016-11-05T01:00:48Z

Test build #68167 has finished for PR 15777 at commit 89f87ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ClusteringSummary(JavaWrapper):
- class GaussianMixtureSummary(ClusteringSummary):
- class BisectingKMeansSummary(ClusteringSummary):

SparkQA · 2016-11-05T19:58:44Z

Test build #68216 has finished for PR 15777 at commit fc248b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-11-06T00:47:56Z

Thanks for working on this @sethah - more work towards increased parity is good :)

sethah · 2016-11-14T16:02:25Z

ping @yanboliang

sethah · 2016-11-14T16:48:22Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

@@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml] (
  private var trainingSummary: Option[BisectingKMeansSummary] = None

  private[clustering] def setSummary(summary: BisectingKMeansSummary): this.type = {
-    this.trainingSummary = Some(summary)
+    this.trainingSummary = Option(summary)


per @holdenk's suggestion, I changed the setSummary to use Option.apply which treats null as None. This allows us to exercise the test case for hasSummary when summary == None. This was never tested in Scala either so I added unit tests for both Scala and Python.

I'd more prefer to make the argument as Option[BisectingKMeansSummary] like:

private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = { this.trainingSummary = summary this }

And test summary == None by:

model.setSummary(None) assert(!model.hasSummary)

Since I think setSummary(null) and test whether it existing is very tricky. The type of summary is Option[BisectingKMeansSummary] and with None as default value, so setSummary(None) should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary at Python side, and Python None would be converted to null in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.

SparkQA · 2016-11-14T17:52:37Z

Test build #68624 has finished for PR 15777 at commit 29b6496.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-11-15T08:20:22Z

python/pyspark/ml/clustering.py

 from pyspark.ml.param.shared import *
 from pyspark.ml.common import inherit_doc

-__all__ = ['BisectingKMeans', 'BisectingKMeansModel',
+__all__ = ['BisectingKMeans', 'BisectingKMeansModel', 'BisectingKMeansSummary',
+           'ClusteringSummary',


I think we dont need to expose ClusteringSummary, for in the scala side ClusteringSummary is private in [clustering].

+1 @zhengruifeng

yanboliang · 2016-11-15T14:55:25Z

@sethah I will take a look in a few days. Thanks.

yanboliang · 2016-11-16T15:02:08Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

@@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml] (
  private var trainingSummary: Option[BisectingKMeansSummary] = None

  private[clustering] def setSummary(summary: BisectingKMeansSummary): this.type = {
-    this.trainingSummary = Some(summary)
+    this.trainingSummary = Option(summary)


I'd more prefer to make the argument as Option[BisectingKMeansSummary] like:

private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = { this.trainingSummary = summary this }

And test summary == None by:

model.setSummary(None) assert(!model.hasSummary)

Since I think setSummary(null) and test whether it existing is very tricky. The type of summary is Option[BisectingKMeansSummary] and with None as default value, so setSummary(None) should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary at Python side, and Python None would be converted to null in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.

yanboliang · 2016-11-16T15:06:58Z

python/pyspark/ml/clustering.py

 from pyspark.ml.param.shared import *
 from pyspark.ml.common import inherit_doc

-__all__ = ['BisectingKMeans', 'BisectingKMeansModel',
+__all__ = ['BisectingKMeans', 'BisectingKMeansModel', 'BisectingKMeansSummary',
+           'ClusteringSummary',


+1 @zhengruifeng

yanboliang · 2016-11-16T15:15:49Z

python/pyspark/ml/classification.py

+            # Note: Once multiclass is added, update this to return correct summary
+            return BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
+        else:
+            raise RuntimeError("No training summary available for this %s" %


I like this change, we should always throw an exception easy to understand by users.

yanboliang · 2016-11-16T15:20:21Z

python/pyspark/ml/clustering.py

+        training set. An exception is thrown if no summary exists.
+        """
+        if self.hasSummary:
+            return GaussianMixtureSummary(self._call_java("summary"))


Typo, should be BisectingKMeansSummary?

Wow, good catch!

yanboliang · 2016-11-16T15:43:54Z

python/pyspark/ml/tests.py

+        self.assertEqual(len(s.clusterSizes), 2)
+        self.assertEqual(s.k, 2)
+
+        model._call_java("setSummary", None)


Why we test this? Actually setSummary is private and it should not generate a model w/o summary in ordinary case. I think we should only test the public API for Python. Further more, if we want to test model w/o summary, we need to write a dummy Scala model w/o summary, and check hasSummary directly at Python side. I think if the Scala function is not public, we may not confirm Scala/Python compatibility, so testing is also not very make sense.

This came out of a suggestion in the other PR to add KMeansSummary. Adding a hasSummary method implies that models can both have and not have summaries. How can we test that hasSummary works properly when we can only exercise one of its test cases?

Edit: hasSummary will return false if the model is saved and then loaded back since the summary is not saved with the model, so it doesn't always return true. We could test the hasSummary method by calling save/load but that seems expensive just to test a simple function.

Yeah, after loading a saved model, the summary should be None and hasSummary return false. I think this is the correct test route, although with some extra cost. What about add hasSummary test at save/load doc test? Then it should not need extra cost.

We can do it, though typically the doc tests are for things that we want to test that also illustrate functionality to the users. And @jkbradley seemed against adding it as a doc test here. I'll add it for now and we can revert it if we decide that's best.

Yeah, I think it makes sense to add summary related doc tests for algorithms to illustrate the output of summary. So one more line to check hasSummary does not seam to have much impact. @jkbradley What's your opinion? Thanks.

@yanboliang I switched it up. Let me know what you think

SparkQA · 2016-11-16T17:31:51Z

Test build #68722 has finished for PR 15777 at commit 7c2c9ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-17T17:04:20Z

Test build #68789 has finished for PR 15777 at commit b6062b9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-17T19:39:45Z

Test build #68794 has finished for PR 15777 at commit 428348d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

@sethah Only last comment, otherwise, LGTM. I'd like to get this in before 2.1. Thanks.

yanboliang · 2016-11-19T11:44:49Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

@@ -95,8 +95,7 @@ class BisectingKMeansModel private[ml] (
  @Since("2.0.0")
  override def copy(extra: ParamMap): BisectingKMeansModel = {
    val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra)
-    if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
-    copied.setParent(this.parent)
+    copied.setSummary(trainingSummary).setParent(this.parent)


This looks better. Could you make the change for Scala LiR, LoR, GLM and KMeans as well? I think they should be consistent. Thanks.

Updated. I also added tests. Thanks for reviewing!

SparkQA · 2016-11-21T07:11:21Z

Test build #68919 has finished for PR 15777 at commit d6caa02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-11-21T13:36:28Z

LGTM, merged into master and branch-2.1. Thanks!

…d BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <[email protected]> Closes #15777 from sethah/pyspark_cluster_summaries. (cherry picked from commit e811fbf) Signed-off-by: Yanbo Liang <[email protected]>

jkbradley · 2016-11-21T23:16:02Z

@yanboliang I noticed the JIRA is still pending. Are there follow-up tasks?

yanboliang · 2016-11-22T00:33:26Z

No follow-up, forgetting to close it. Thanks for reminding.

…d BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <[email protected]> Closes apache#15777 from sethah/pyspark_cluster_summaries.

sethah mentioned this pull request Nov 5, 2016

[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557

Closed

sethah commented Nov 5, 2016

View reviewed changes

sethah commented Nov 14, 2016

View reviewed changes

zhengruifeng reviewed Nov 15, 2016

View reviewed changes

yanboliang reviewed Nov 16, 2016

View reviewed changes

sethah added 6 commits November 17, 2016 09:15

add python clustering summaries and tests

edc2c44

update __all__

c3859da

use Option.apply and add tests to scala for hasSummary

f599175

correct BKM summary return type

6f89617

add doc tests and change set summary input to option

952d24a

rebase

428348d

sethah force-pushed the pyspark_cluster_summaries branch from b6062b9 to 428348d Compare November 17, 2016 18:30

yanboliang reviewed Nov 19, 2016

View reviewed changes

update setSummary for other algos

d6caa02

asfgit closed this in e811fbf Nov 21, 2016

wangmiao1981 mentioned this pull request Feb 15, 2017

[SPARK-14894][PySpark] Add result summary api to Gaussian Mixture #12675

Closed

[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777

[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777

Conversation

sethah commented Nov 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 5, 2016

SparkQA commented Nov 5, 2016

holdenk commented Nov 6, 2016

sethah commented Nov 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang commented Nov 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah Nov 16, 2016 • edited Loading

Choose a reason for hiding this comment

yanboliang Nov 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 16, 2016

SparkQA commented Nov 17, 2016

SparkQA commented Nov 17, 2016

yanboliang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah Nov 21, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2016

yanboliang commented Nov 21, 2016

jkbradley commented Nov 21, 2016

yanboliang commented Nov 22, 2016

sethah Nov 16, 2016 •

edited

Loading

yanboliang Nov 17, 2016 •

edited

Loading

yanboliang left a comment •

edited

Loading

sethah Nov 21, 2016 •

edited

Loading