Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777

Closed
wants to merge 7 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Nov 5, 2016

What changes were proposed in this pull request?

Add model summary APIs for GaussianMixtureModel and BisectingKMeansModel in pyspark.

How was this patch tested?

Unit tests.

# Note: Once multiclass is added, update this to return correct summary
return BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
else:
raise RuntimeError("No training summary available for this %s" %
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, this would throw a Py4JJavaError. I think it's slightly better to throw a RuntimeError here as is done in Scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats generally a good improvement, the Py4J errors are often confusing to end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, we should always throw an exception easy to understand by users.

self.assertEqual(len(s.clusterSizes), 2)
self.assertEqual(s.k, 2)

# TODO: test when there is no summary
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test that hasSummary returns False when there is no summary available, and that summary throws an exception. The problem I'm having is that I'm not sure how to create this test case. The only way to get a model is by calling fit, which will produce a model with a summary. Calling model._call_java("setSummary", None) doesn't work either. Is there some way that I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to update setSummary to treat null as empty (e.g. Option instead of Some)) for easy testing.

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68167 has finished for PR 15777 at commit 89f87ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ClusteringSummary(JavaWrapper):
    • class GaussianMixtureSummary(ClusteringSummary):
    • class BisectingKMeansSummary(ClusteringSummary):

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68216 has finished for PR 15777 at commit fc248b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 6, 2016

Thanks for working on this @sethah - more work towards increased parity is good :)

@sethah
Copy link
Contributor Author

sethah commented Nov 14, 2016

ping @yanboliang

@@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml] (
private var trainingSummary: Option[BisectingKMeansSummary] = None

private[clustering] def setSummary(summary: BisectingKMeansSummary): this.type = {
this.trainingSummary = Some(summary)
this.trainingSummary = Option(summary)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @holdenk's suggestion, I changed the setSummary to use Option.apply which treats null as None. This allows us to exercise the test case for hasSummary when summary == None. This was never tested in Scala either so I added unit tests for both Scala and Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd more prefer to make the argument as Option[BisectingKMeansSummary] like:

private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = {
    this.trainingSummary = summary
    this
}

And test summary == None by:

model.setSummary(None)
assert(!model.hasSummary)

Since I think setSummary(null) and test whether it existing is very tricky. The type of summary is Option[BisectingKMeansSummary] and with None as default value, so setSummary(None) should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary at Python side, and Python None would be converted to null in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.

@SparkQA
Copy link

SparkQA commented Nov 14, 2016

Test build #68624 has finished for PR 15777 at commit 29b6496.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

from pyspark.ml.param.shared import *
from pyspark.ml.common import inherit_doc

__all__ = ['BisectingKMeans', 'BisectingKMeansModel',
__all__ = ['BisectingKMeans', 'BisectingKMeansModel', 'BisectingKMeansSummary',
'ClusteringSummary',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dont need to expose ClusteringSummary, for in the scala side ClusteringSummary is private in [clustering].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang
Copy link
Contributor

@sethah I will take a look in a few days. Thanks.

@@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml] (
private var trainingSummary: Option[BisectingKMeansSummary] = None

private[clustering] def setSummary(summary: BisectingKMeansSummary): this.type = {
this.trainingSummary = Some(summary)
this.trainingSummary = Option(summary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd more prefer to make the argument as Option[BisectingKMeansSummary] like:

private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = {
    this.trainingSummary = summary
    this
}

And test summary == None by:

model.setSummary(None)
assert(!model.hasSummary)

Since I think setSummary(null) and test whether it existing is very tricky. The type of summary is Option[BisectingKMeansSummary] and with None as default value, so setSummary(None) should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary at Python side, and Python None would be converted to null in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.

from pyspark.ml.param.shared import *
from pyspark.ml.common import inherit_doc

__all__ = ['BisectingKMeans', 'BisectingKMeansModel',
__all__ = ['BisectingKMeans', 'BisectingKMeansModel', 'BisectingKMeansSummary',
'ClusteringSummary',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Note: Once multiclass is added, update this to return correct summary
return BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
else:
raise RuntimeError("No training summary available for this %s" %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, we should always throw an exception easy to understand by users.

training set. An exception is thrown if no summary exists.
"""
if self.hasSummary:
return GaussianMixtureSummary(self._call_java("summary"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be BisectingKMeansSummary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, good catch!

self.assertEqual(len(s.clusterSizes), 2)
self.assertEqual(s.k, 2)

model._call_java("setSummary", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we test this? Actually setSummary is private and it should not generate a model w/o summary in ordinary case. I think we should only test the public API for Python. Further more, if we want to test model w/o summary, we need to write a dummy Scala model w/o summary, and check hasSummary directly at Python side. I think if the Scala function is not public, we may not confirm Scala/Python compatibility, so testing is also not very make sense.

Copy link
Contributor Author

@sethah sethah Nov 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came out of a suggestion in the other PR to add KMeansSummary. Adding a hasSummary method implies that models can both have and not have summaries. How can we test that hasSummary works properly when we can only exercise one of its test cases?

Edit: hasSummary will return false if the model is saved and then loaded back since the summary is not saved with the model, so it doesn't always return true. We could test the hasSummary method by calling save/load but that seems expensive just to test a simple function.

Copy link
Contributor

@yanboliang yanboliang Nov 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, after loading a saved model, the summary should be None and hasSummary return false. I think this is the correct test route, although with some extra cost. What about add hasSummary test at save/load doc test? Then it should not need extra cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it, though typically the doc tests are for things that we want to test that also illustrate functionality to the users. And @jkbradley seemed against adding it as a doc test here. I'll add it for now and we can revert it if we decide that's best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it makes sense to add summary related doc tests for algorithms to illustrate the output of summary. So one more line to check hasSummary does not seam to have much impact. @jkbradley What's your opinion? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang I switched it up. Let me know what you think

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68722 has finished for PR 15777 at commit 7c2c9ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68789 has finished for PR 15777 at commit b6062b9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68794 has finished for PR 15777 at commit 428348d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah Only last comment, otherwise, LGTM. I'd like to get this in before 2.1. Thanks.

@@ -95,8 +95,7 @@ class BisectingKMeansModel private[ml] (
@Since("2.0.0")
override def copy(extra: ParamMap): BisectingKMeansModel = {
val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra)
if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get)
copied.setParent(this.parent)
copied.setSummary(trainingSummary).setParent(this.parent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better. Could you make the change for Scala LiR, LoR, GLM and KMeans as well? I think they should be consistent. Thanks.

Copy link
Contributor Author

@sethah sethah Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I also added tests. Thanks for reviewing!

@SparkQA
Copy link

SparkQA commented Nov 21, 2016

Test build #68919 has finished for PR 15777 at commit d6caa02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM, merged into master and branch-2.1. Thanks!

asfgit pushed a commit that referenced this pull request Nov 21, 2016
…d BKM

## What changes were proposed in this pull request?

Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.

## How was this patch tested?

Unit tests.

Author: sethah <[email protected]>

Closes #15777 from sethah/pyspark_cluster_summaries.

(cherry picked from commit e811fbf)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in e811fbf Nov 21, 2016
@jkbradley
Copy link
Member

@yanboliang I noticed the JIRA is still pending. Are there follow-up tasks?

@yanboliang
Copy link
Contributor

No follow-up, forgetting to close it. Thanks for reminding.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…d BKM

## What changes were proposed in this pull request?

Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.

## How was this patch tested?

Unit tests.

Author: sethah <[email protected]>

Closes apache#15777 from sethah/pyspark_cluster_summaries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants