-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777
Conversation
# Note: Once multiclass is added, update this to return correct summary | ||
return BinaryLogisticRegressionTrainingSummary(java_blrt_summary) | ||
else: | ||
raise RuntimeError("No training summary available for this %s" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, this would throw a Py4JJavaError
. I think it's slightly better to throw a RuntimeError
here as is done in Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats generally a good improvement, the Py4J
errors are often confusing to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, we should always throw an exception easy to understand by users.
self.assertEqual(len(s.clusterSizes), 2) | ||
self.assertEqual(s.k, 2) | ||
|
||
# TODO: test when there is no summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test that hasSummary
returns False
when there is no summary available, and that summary
throws an exception. The problem I'm having is that I'm not sure how to create this test case. The only way to get a model is by calling fit, which will produce a model with a summary. Calling model._call_java("setSummary", None)
doesn't work either. Is there some way that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to update setSummary
to treat null as empty (e.g. Option
instead of Some
)) for easy testing.
Test build #68167 has finished for PR 15777 at commit
|
Test build #68216 has finished for PR 15777 at commit
|
Thanks for working on this @sethah - more work towards increased parity is good :) |
ping @yanboliang |
@@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml] ( | |||
private var trainingSummary: Option[BisectingKMeansSummary] = None | |||
|
|||
private[clustering] def setSummary(summary: BisectingKMeansSummary): this.type = { | |||
this.trainingSummary = Some(summary) | |||
this.trainingSummary = Option(summary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per @holdenk's suggestion, I changed the setSummary
to use Option.apply
which treats null
as None
. This allows us to exercise the test case for hasSummary
when summary == None
. This was never tested in Scala either so I added unit tests for both Scala and Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd more prefer to make the argument as Option[BisectingKMeansSummary]
like:
private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = {
this.trainingSummary = summary
this
}
And test summary == None
by:
model.setSummary(None)
assert(!model.hasSummary)
Since I think setSummary(null)
and test whether it existing is very tricky. The type of summary
is Option[BisectingKMeansSummary]
and with None
as default value, so setSummary(None)
should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary
at Python side, and Python None would be converted to null
in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.
Test build #68624 has finished for PR 15777 at commit
|
from pyspark.ml.param.shared import * | ||
from pyspark.ml.common import inherit_doc | ||
|
||
__all__ = ['BisectingKMeans', 'BisectingKMeansModel', | ||
__all__ = ['BisectingKMeans', 'BisectingKMeansModel', 'BisectingKMeansSummary', | ||
'ClusteringSummary', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we dont need to expose ClusteringSummary
, for in the scala side ClusteringSummary
is private in [clustering].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah I will take a look in a few days. Thanks. |
@@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml] ( | |||
private var trainingSummary: Option[BisectingKMeansSummary] = None | |||
|
|||
private[clustering] def setSummary(summary: BisectingKMeansSummary): this.type = { | |||
this.trainingSummary = Some(summary) | |||
this.trainingSummary = Option(summary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd more prefer to make the argument as Option[BisectingKMeansSummary]
like:
private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = {
this.trainingSummary = summary
this
}
And test summary == None
by:
model.setSummary(None)
assert(!model.hasSummary)
Since I think setSummary(null)
and test whether it existing is very tricky. The type of summary
is Option[BisectingKMeansSummary]
and with None
as default value, so setSummary(None)
should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary
at Python side, and Python None would be converted to null
in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.
from pyspark.ml.param.shared import * | ||
from pyspark.ml.common import inherit_doc | ||
|
||
__all__ = ['BisectingKMeans', 'BisectingKMeansModel', | ||
__all__ = ['BisectingKMeans', 'BisectingKMeansModel', 'BisectingKMeansSummary', | ||
'ClusteringSummary', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Note: Once multiclass is added, update this to return correct summary | ||
return BinaryLogisticRegressionTrainingSummary(java_blrt_summary) | ||
else: | ||
raise RuntimeError("No training summary available for this %s" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, we should always throw an exception easy to understand by users.
training set. An exception is thrown if no summary exists. | ||
""" | ||
if self.hasSummary: | ||
return GaussianMixtureSummary(self._call_java("summary")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, should be BisectingKMeansSummary
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, good catch!
self.assertEqual(len(s.clusterSizes), 2) | ||
self.assertEqual(s.k, 2) | ||
|
||
model._call_java("setSummary", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we test this? Actually setSummary
is private and it should not generate a model w/o summary in ordinary case. I think we should only test the public API for Python. Further more, if we want to test model w/o summary, we need to write a dummy Scala model w/o summary, and check hasSummary
directly at Python side. I think if the Scala function is not public, we may not confirm Scala/Python compatibility, so testing is also not very make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came out of a suggestion in the other PR to add KMeansSummary
. Adding a hasSummary
method implies that models can both have and not have summaries. How can we test that hasSummary
works properly when we can only exercise one of its test cases?
Edit: hasSummary
will return false if the model is saved and then loaded back since the summary is not saved with the model, so it doesn't always return true. We could test the hasSummary
method by calling save/load but that seems expensive just to test a simple function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, after loading a saved model, the summary
should be None
and hasSummary
return false. I think this is the correct test route, although with some extra cost. What about add hasSummary
test at save/load
doc test? Then it should not need extra cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it, though typically the doc tests are for things that we want to test that also illustrate functionality to the users. And @jkbradley seemed against adding it as a doc test here. I'll add it for now and we can revert it if we decide that's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it makes sense to add summary related doc tests for algorithms to illustrate the output of summary. So one more line to check hasSummary
does not seam to have much impact. @jkbradley What's your opinion? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang I switched it up. Let me know what you think
Test build #68722 has finished for PR 15777 at commit
|
Test build #68789 has finished for PR 15777 at commit
|
b6062b9
to
428348d
Compare
Test build #68794 has finished for PR 15777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah Only last comment, otherwise, LGTM. I'd like to get this in before 2.1. Thanks.
@@ -95,8 +95,7 @@ class BisectingKMeansModel private[ml] ( | |||
@Since("2.0.0") | |||
override def copy(extra: ParamMap): BisectingKMeansModel = { | |||
val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra) | |||
if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get) | |||
copied.setParent(this.parent) | |||
copied.setSummary(trainingSummary).setParent(this.parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better. Could you make the change for Scala LiR, LoR, GLM and KMeans as well? I think they should be consistent. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I also added tests. Thanks for reviewing!
Test build #68919 has finished for PR 15777 at commit
|
LGTM, merged into master and branch-2.1. Thanks! |
…d BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <[email protected]> Closes #15777 from sethah/pyspark_cluster_summaries. (cherry picked from commit e811fbf) Signed-off-by: Yanbo Liang <[email protected]>
@yanboliang I noticed the JIRA is still pending. Are there follow-up tasks? |
No follow-up, forgetting to close it. Thanks for reminding. |
…d BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <[email protected]> Closes apache#15777 from sethah/pyspark_cluster_summaries.
What changes were proposed in this pull request?
Add model summary APIs for
GaussianMixtureModel
andBisectingKMeansModel
in pyspark.How was this patch tested?
Unit tests.