[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557

zjffdu · 2016-06-08T08:55:21Z

What changes were proposed in this pull request?

Add python api for KMeansSummary

How was this patch tested?

unit test added

SparkQA · 2016-06-08T09:16:08Z

Test build #60157 has finished for PR 13557 at commit 711c26e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansSummary(JavaWrapper):

SparkQA · 2016-06-08T10:18:51Z

Test build #60160 has finished for PR 13557 at commit 72634c7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-08T10:44:01Z

Test build #60163 has finished for PR 13557 at commit aacc4db.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-08T11:43:28Z

Test build #60167 has finished for PR 13557 at commit d1d2222.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-08T12:33:26Z

Test build #60171 has finished for PR 13557 at commit 73bfb05.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-08T13:43:43Z

Test build #60175 has finished for PR 13557 at commit 003af9f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-09T15:10:57Z

Test build #60236 has finished for PR 13557 at commit d2fd75a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-06-11T03:00:19Z

@jkbradley Could you help review it ? Thanks

jkbradley · 2016-06-17T19:30:58Z

Sorry for the delay! I'll try to review it as soon as QA for 2.0 is done.

zjffdu · 2016-08-11T08:33:14Z

PR rebased, ping @jkbradley Please help review when you have time.

SparkQA · 2016-08-11T09:01:39Z

Test build #63600 has finished for PR 13557 at commit 099d42d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks for working on this @zjffdu :)
The scala version appears to also have a clusters value for the cluster centers and predictions for the predictions it would probably be good to expose these too.

holdenk · 2016-10-07T20:00:06Z

python/pyspark/ml/clustering.py

@@ -201,7 +203,74 @@ def computeCost(self, dataset):
        """
        return self._call_java("computeCost", dataset)

+    @since("2.0.0")


We are now post 2.0, so we will need to update the since annotations.

holdenk · 2016-10-07T20:00:48Z

python/pyspark/ml/clustering.py

+    """
+    Summary of KMeans.
+
+    .. versionadded:: 2.0.0


same for versionadded need to be updated

holdenk · 2016-10-07T20:02:04Z

python/pyspark/ml/clustering.py

+    @since("2.0.0")
+    def summary(self):
+        """
+        Gets summary of model on training set.


Maybe good to add a notice this causes an exception if there is no summary present (as done in the scala docs).

SparkQA · 2016-10-17T04:20:28Z

Test build #67048 has finished for PR 13557 at commit 19f871f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-10-18T10:12:04Z

python/pyspark/ml/clustering.py

 from pyspark.ml.param.shared import *
 from pyspark.ml.common import inherit_doc
+from pyspark.rdd import ignore_unicode_prefix
+


So in the other random files I sampled in this directory we only have one new line, was there a reason for this change?

holdenk · 2016-10-18T10:12:18Z

python/pyspark/ml/clustering.py

+    @since("2.1.0")
+    def summary(self):
+        """
+        Gets summary of model on training set. Or rasei exception is no summary is present.


Typo (rasei)

SparkQA · 2016-10-18T13:52:49Z

Test build #67127 has finished for PR 13557 at commit c11b6a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-10-21T00:54:04Z

ping @jkbradley @holdenk

yanboliang · 2016-10-26T15:32:43Z

jenkins test this please.

SparkQA · 2016-10-26T16:03:22Z

Test build #67585 has finished for PR 13557 at commit c11b6a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-26T18:44:21Z

python/pyspark/ml/clustering.py

+        """
+        Return true if there exists summary of model.
+        """
+        return self.summary is not None


Could you please add a unit test (not a doc test) to make sure this works for both cases, with and without a summary? It looks like it will throw an error if there is no summary.

Good point. Other summaries have implemented this method by calling:

return self._call_java("hasSummary")

About the unit test - how can you create a situation where the model does not have a summary?

@zjffdu Can you make this match the other summary implementations? i.e. self._call_java("hasSummary")

sethah · 2016-11-05T00:43:48Z

I created SPARK-18282 and the PR: #15777 to implement this interface for GMM and BisectingKMeans. These two PRs will affect one another, I'm not sure it matters too much which gets merged first, but whichever one does get merged first needs to implement a parent class ClusteringSummary which will have (for now) three child classes for KMeans, BisectingKMeans, and GMM. This is a result of #15555.

yanboliang · 2016-11-14T07:06:57Z

ping @zjffdu Could you address @jkbradley 's comment, then I can help to get this in. It's better we can merge this into Spark 2.1. Thanks.

holdenk · 2016-11-24T09:56:24Z

ping @zjffdu - @sethah's PR for the GMM and BKM has gone in it might be good to update this PR now then?

SparkQA · 2016-11-28T13:28:21Z

Test build #69245 has finished for PR 13557 at commit f94564e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KMeansSummary(ClusteringSummary):

SparkQA · 2016-11-28T14:02:03Z

Test build #69246 has finished for PR 13557 at commit b3a8068.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KMeansSummary(ClusteringSummary):

sethah · 2016-11-28T16:19:18Z

python/pyspark/ml/clustering.py

+        """
+        Return true if there exists summary of model.
+        """
+        return self.summary is not None


@zjffdu Can you make this match the other summary implementations? i.e. self._call_java("hasSummary")

sethah · 2016-11-28T16:23:34Z

python/pyspark/ml/clustering.py

@@ -330,6 +357,20 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol
    >>> df = spark.createDataFrame(data, ["features"])
    >>> kmeans = KMeans(k=2, seed=1)
    >>> model = kmeans.fit(df)
+    >>> summary = model.summary


Can we make this doc test the same as the other clustering classes? We should test model.hasSummary before and after save/load as was done in the other implementations. Also, we need to add a test to tests.py for this summary (see that file and reference the other summary tests).

sethah · 2016-11-28T16:25:30Z

python/pyspark/ml/clustering.py

+    >>> summary = model.summary
+    >>> summary.k
+    2
+    >>> summary.predictionCol


things like this are better left for the tests.py file IMO, since this shows up in the doc and might not be useful to users

sethah · 2016-11-28T16:25:53Z

python/pyspark/ml/clustering.py

+
+    @property
+    @since("2.1.0")
+    def summary(self):


Can you match the other implementations here as well?

SparkQA · 2016-11-29T03:34:05Z

Test build #69291 has finished for PR 13557 at commit 99b03be.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T04:03:40Z

Test build #69292 has finished for PR 13557 at commit cfd2212.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2016-11-29T04:20:14Z

@sethah Thanks for the review, I have updated the PR.

SparkQA · 2016-11-29T07:11:34Z

Test build #69302 has finished for PR 13557 at commit 01c6da9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah

Just a few small things. I think it would be nice to get this into 2.1 considering we have the other summaries implemented, but if it slips we have to update the version tags.

sethah · 2016-11-29T22:44:58Z

python/pyspark/ml/clustering.py

+
+class KMeansSummary(ClusteringSummary):
+    """
+    Summary of KMeans.


add .. note:: Experimental

sethah · 2016-11-29T22:45:31Z

python/pyspark/ml/clustering.py

+                               self.__class__.__name__)
+
+
+class KMeansSummary(ClusteringSummary):


Let's move it after the KMeans class like the others.

sethah · 2016-11-29T22:47:40Z

python/pyspark/ml/clustering.py

@@ -349,6 +379,8 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol
    >>> model_path = temp_path + "/kmeans_model"


Add

>>> model.hasSummary True >>> summary = model.summary >>> summary.k 2 >>> summary.clusterSizes [2, 2]

to match other doctests

holdenk · 2016-11-29T23:24:20Z

Thanks for updating this @zjffdu, it looks good to me pending @sethah's comments - maybe we can get @davies or @MLnick to take a final pass?

SparkQA · 2016-11-30T01:59:07Z

Test build #69368 has finished for PR 13557 at commit 032bb9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KMeansSummary(ClusteringSummary):

yanboliang · 2016-11-30T04:48:33Z

LGTM, merged into master and branch-2.1. Thank you all.

## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <[email protected]> Closes #13557 from zjffdu/SPARK-15819. (cherry picked from commit 4c82ca8) Signed-off-by: Yanbo Liang <[email protected]>

## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <[email protected]> Closes apache#13557 from zjffdu/SPARK-15819.

zjffdu force-pushed the SPARK-15819 branch from d1d2222 to 73bfb05 Compare June 8, 2016 12:12

zjffdu force-pushed the SPARK-15819 branch from 73bfb05 to d47e671 Compare June 8, 2016 13:23

zjffdu changed the title ~~[SPARK-15819][PYSPARK] Add KMeanSummary in KMeans of PySpark~~ [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark Jun 11, 2016

zjffdu force-pushed the SPARK-15819 branch from d2fd75a to 099d42d Compare August 11, 2016 08:32

holdenk reviewed Oct 7, 2016

View reviewed changes

zjffdu force-pushed the SPARK-15819 branch from 099d42d to 19f871f Compare October 17, 2016 03:55

holdenk reviewed Oct 18, 2016

View reviewed changes

zjffdu force-pushed the SPARK-15819 branch from 19f871f to c11b6a0 Compare October 18, 2016 13:27

jkbradley reviewed Oct 26, 2016

View reviewed changes

sethah mentioned this pull request Nov 17, 2016

[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777

Closed

[SPARK-15819][PYSPARK] Add KMeanSummary in KMeans of PySpark

59e67c8

zjffdu added 6 commits November 28, 2016 20:54

fix unit test failure

112316d

fix unit test failure

11e5d8e

fix code style

73b3763

trigger build

356885b

fix test failure

2c78c69

address comments

ef9436a

zjffdu force-pushed the SPARK-15819 branch from c11b6a0 to f94564e Compare November 28, 2016 13:12

extends ClusteringSummary

b3a8068

zjffdu force-pushed the SPARK-15819 branch from f94564e to b3a8068 Compare November 28, 2016 13:32

sethah reviewed Nov 28, 2016

View reviewed changes

zjffdu force-pushed the SPARK-15819 branch from 99b03be to cfd2212 Compare November 29, 2016 03:36

zjffdu force-pushed the SPARK-15819 branch from cfd2212 to 01c6da9 Compare November 29, 2016 06:41

sethah reviewed Nov 29, 2016

View reviewed changes

address the comments

032bb9d

zjffdu force-pushed the SPARK-15819 branch from 01c6da9 to 032bb9d Compare November 30, 2016 01:29

asfgit closed this in 4c82ca8 Nov 30, 2016

		self.__class__.__name__)


		class KMeansSummary(ClusteringSummary):

		@@ -349,6 +379,8 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol
		>>> model_path = temp_path + "/kmeans_model"

[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557

[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557

Conversation

zjffdu commented Jun 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 8, 2016

SparkQA commented Jun 8, 2016

SparkQA commented Jun 8, 2016

SparkQA commented Jun 8, 2016

SparkQA commented Jun 8, 2016

SparkQA commented Jun 8, 2016

SparkQA commented Jun 9, 2016

zjffdu commented Jun 11, 2016

jkbradley commented Jun 17, 2016

zjffdu commented Aug 11, 2016

SparkQA commented Aug 11, 2016

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 18, 2016

zjffdu commented Oct 21, 2016

yanboliang commented Oct 26, 2016

SparkQA commented Oct 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Nov 5, 2016

yanboliang commented Nov 14, 2016

holdenk commented Nov 24, 2016

SparkQA commented Nov 28, 2016

SparkQA commented Nov 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah Nov 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

zjffdu commented Nov 29, 2016

SparkQA commented Nov 29, 2016

sethah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Nov 29, 2016 • edited Loading

SparkQA commented Nov 30, 2016

yanboliang commented Nov 30, 2016

sethah Nov 28, 2016 •

edited

Loading

holdenk commented Nov 29, 2016 •

edited

Loading