-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557
Conversation
Test build #60157 has finished for PR 13557 at commit
|
Test build #60160 has finished for PR 13557 at commit
|
Test build #60163 has finished for PR 13557 at commit
|
Test build #60167 has finished for PR 13557 at commit
|
Test build #60171 has finished for PR 13557 at commit
|
Test build #60175 has finished for PR 13557 at commit
|
Test build #60236 has finished for PR 13557 at commit
|
@jkbradley Could you help review it ? Thanks |
Sorry for the delay! I'll try to review it as soon as QA for 2.0 is done. |
PR rebased, ping @jkbradley Please help review when you have time. |
Test build #63600 has finished for PR 13557 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @zjffdu :)
The scala version appears to also have a clusters
value for the cluster centers and predictions
for the predictions it would probably be good to expose these too.
@@ -201,7 +203,74 @@ def computeCost(self, dataset): | |||
""" | |||
return self._call_java("computeCost", dataset) | |||
|
|||
@since("2.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now post 2.0, so we will need to update the since annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
""" | ||
Summary of KMeans. | ||
|
||
.. versionadded:: 2.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for versionadded need to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@since("2.0.0") | ||
def summary(self): | ||
""" | ||
Gets summary of model on training set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to add a notice this causes an exception if there is no summary present (as done in the scala docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Test build #67048 has finished for PR 13557 at commit
|
from pyspark.ml.param.shared import * | ||
from pyspark.ml.common import inherit_doc | ||
from pyspark.rdd import ignore_unicode_prefix | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the other random files I sampled in this directory we only have one new line, was there a reason for this change?
@since("2.1.0") | ||
def summary(self): | ||
""" | ||
Gets summary of model on training set. Or rasei exception is no summary is present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo (rasei)
Test build #67127 has finished for PR 13557 at commit
|
ping @jkbradley @holdenk |
jenkins test this please. |
Test build #67585 has finished for PR 13557 at commit
|
""" | ||
Return true if there exists summary of model. | ||
""" | ||
return self.summary is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a unit test (not a doc test) to make sure this works for both cases, with and without a summary? It looks like it will throw an error if there is no summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Other summaries have implemented this method by calling:
return self._call_java("hasSummary")
About the unit test - how can you create a situation where the model does not have a summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zjffdu Can you make this match the other summary implementations? i.e. self._call_java("hasSummary")
I created SPARK-18282 and the PR: #15777 to implement this interface for GMM and BisectingKMeans. These two PRs will affect one another, I'm not sure it matters too much which gets merged first, but whichever one does get merged first needs to implement a parent class |
ping @zjffdu Could you address @jkbradley 's comment, then I can help to get this in. It's better we can merge this into Spark 2.1. Thanks. |
Test build #69245 has finished for PR 13557 at commit
|
Test build #69246 has finished for PR 13557 at commit
|
""" | ||
Return true if there exists summary of model. | ||
""" | ||
return self.summary is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zjffdu Can you make this match the other summary implementations? i.e. self._call_java("hasSummary")
@@ -330,6 +357,20 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol | |||
>>> df = spark.createDataFrame(data, ["features"]) | |||
>>> kmeans = KMeans(k=2, seed=1) | |||
>>> model = kmeans.fit(df) | |||
>>> summary = model.summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this doc test the same as the other clustering classes? We should test model.hasSummary
before and after save/load as was done in the other implementations. Also, we need to add a test to tests.py
for this summary (see that file and reference the other summary tests).
>>> summary = model.summary | ||
>>> summary.k | ||
2 | ||
>>> summary.predictionCol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
things like this are better left for the tests.py
file IMO, since this shows up in the doc and might not be useful to users
|
||
@property | ||
@since("2.1.0") | ||
def summary(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you match the other implementations here as well?
Test build #69291 has finished for PR 13557 at commit
|
Test build #69292 has finished for PR 13557 at commit
|
@sethah Thanks for the review, I have updated the PR. |
Test build #69302 has finished for PR 13557 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small things. I think it would be nice to get this into 2.1 considering we have the other summaries implemented, but if it slips we have to update the version tags.
|
||
class KMeansSummary(ClusteringSummary): | ||
""" | ||
Summary of KMeans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add .. note:: Experimental
self.__class__.__name__) | ||
|
||
|
||
class KMeansSummary(ClusteringSummary): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move it after the KMeans
class like the others.
@@ -349,6 +379,8 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol | |||
>>> model_path = temp_path + "/kmeans_model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add
>>> model.hasSummary
True
>>> summary = model.summary
>>> summary.k
2
>>> summary.clusterSizes
[2, 2]
to match other doctests
Test build #69368 has finished for PR 13557 at commit
|
LGTM, merged into master and branch-2.1. Thank you all. |
## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <[email protected]> Closes #13557 from zjffdu/SPARK-15819. (cherry picked from commit 4c82ca8) Signed-off-by: Yanbo Liang <[email protected]>
## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <[email protected]> Closes apache#13557 from zjffdu/SPARK-15819.
## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <[email protected]> Closes apache#13557 from zjffdu/SPARK-15819.
What changes were proposed in this pull request?
Add python api for KMeansSummary
How was this patch tested?
unit test added