Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark #13557

Closed
wants to merge 9 commits into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Jun 8, 2016

What changes were proposed in this pull request?

Add python api for KMeansSummary

How was this patch tested?

unit test added

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60157 has finished for PR 13557 at commit 711c26e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansSummary(JavaWrapper):

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60160 has finished for PR 13557 at commit 72634c7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60163 has finished for PR 13557 at commit aacc4db.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60167 has finished for PR 13557 at commit d1d2222.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60171 has finished for PR 13557 at commit 73bfb05.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60175 has finished for PR 13557 at commit 003af9f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2016

Test build #60236 has finished for PR 13557 at commit d2fd75a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu zjffdu changed the title [SPARK-15819][PYSPARK] Add KMeanSummary in KMeans of PySpark [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark Jun 11, 2016
@zjffdu
Copy link
Contributor Author

zjffdu commented Jun 11, 2016

@jkbradley Could you help review it ? Thanks

@jkbradley
Copy link
Member

Sorry for the delay! I'll try to review it as soon as QA for 2.0 is done.

@zjffdu
Copy link
Contributor Author

zjffdu commented Aug 11, 2016

PR rebased, ping @jkbradley Please help review when you have time.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63600 has finished for PR 13557 at commit 099d42d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @zjffdu :)
The scala version appears to also have a clusters value for the cluster centers and predictions for the predictions it would probably be good to expose these too.

@@ -201,7 +203,74 @@ def computeCost(self, dataset):
"""
return self._call_java("computeCost", dataset)

@since("2.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now post 2.0, so we will need to update the since annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"""
Summary of KMeans.

.. versionadded:: 2.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for versionadded need to be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@since("2.0.0")
def summary(self):
"""
Gets summary of model on training set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to add a notice this causes an exception if there is no summary present (as done in the scala docs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67048 has finished for PR 13557 at commit 19f871f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

from pyspark.ml.param.shared import *
from pyspark.ml.common import inherit_doc
from pyspark.rdd import ignore_unicode_prefix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the other random files I sampled in this directory we only have one new line, was there a reason for this change?

@since("2.1.0")
def summary(self):
"""
Gets summary of model on training set. Or rasei exception is no summary is present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo (rasei)

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67127 has finished for PR 13557 at commit c11b6a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Oct 21, 2016

ping @jkbradley @holdenk

@yanboliang
Copy link
Contributor

jenkins test this please.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67585 has finished for PR 13557 at commit c11b6a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
Return true if there exists summary of model.
"""
return self.summary is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a unit test (not a doc test) to make sure this works for both cases, with and without a summary? It looks like it will throw an error if there is no summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Other summaries have implemented this method by calling:

return self._call_java("hasSummary")

About the unit test - how can you create a situation where the model does not have a summary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zjffdu Can you make this match the other summary implementations? i.e. self._call_java("hasSummary")

@sethah
Copy link
Contributor

sethah commented Nov 5, 2016

I created SPARK-18282 and the PR: #15777 to implement this interface for GMM and BisectingKMeans. These two PRs will affect one another, I'm not sure it matters too much which gets merged first, but whichever one does get merged first needs to implement a parent class ClusteringSummary which will have (for now) three child classes for KMeans, BisectingKMeans, and GMM. This is a result of #15555.

@yanboliang
Copy link
Contributor

ping @zjffdu Could you address @jkbradley 's comment, then I can help to get this in. It's better we can merge this into Spark 2.1. Thanks.

@holdenk
Copy link
Contributor

holdenk commented Nov 24, 2016

ping @zjffdu - @sethah's PR for the GMM and BKM has gone in it might be good to update this PR now then?

@SparkQA
Copy link

SparkQA commented Nov 28, 2016

Test build #69245 has finished for PR 13557 at commit f94564e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KMeansSummary(ClusteringSummary):

@SparkQA
Copy link

SparkQA commented Nov 28, 2016

Test build #69246 has finished for PR 13557 at commit b3a8068.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KMeansSummary(ClusteringSummary):

"""
Return true if there exists summary of model.
"""
return self.summary is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zjffdu Can you make this match the other summary implementations? i.e. self._call_java("hasSummary")

@@ -330,6 +357,20 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol
>>> df = spark.createDataFrame(data, ["features"])
>>> kmeans = KMeans(k=2, seed=1)
>>> model = kmeans.fit(df)
>>> summary = model.summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this doc test the same as the other clustering classes? We should test model.hasSummary before and after save/load as was done in the other implementations. Also, we need to add a test to tests.py for this summary (see that file and reference the other summary tests).

>>> summary = model.summary
>>> summary.k
2
>>> summary.predictionCol
Copy link
Contributor

@sethah sethah Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

things like this are better left for the tests.py file IMO, since this shows up in the doc and might not be useful to users


@property
@since("2.1.0")
def summary(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you match the other implementations here as well?

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69291 has finished for PR 13557 at commit 99b03be.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69292 has finished for PR 13557 at commit cfd2212.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Nov 29, 2016

@sethah Thanks for the review, I have updated the PR.

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69302 has finished for PR 13557 at commit 01c6da9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small things. I think it would be nice to get this into 2.1 considering we have the other summaries implemented, but if it slips we have to update the version tags.


class KMeansSummary(ClusteringSummary):
"""
Summary of KMeans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add .. note:: Experimental

self.__class__.__name__)


class KMeansSummary(ClusteringSummary):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it after the KMeans class like the others.

@@ -349,6 +379,8 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol
>>> model_path = temp_path + "/kmeans_model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add

    >>> model.hasSummary
    True
    >>> summary = model.summary
    >>> summary.k
    2
    >>> summary.clusterSizes
    [2, 2]

to match other doctests

@holdenk
Copy link
Contributor

holdenk commented Nov 29, 2016

Thanks for updating this @zjffdu, it looks good to me pending @sethah's comments - maybe we can get @davies or @MLnick to take a final pass?

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69368 has finished for PR 13557 at commit 032bb9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KMeansSummary(ClusteringSummary):

@yanboliang
Copy link
Contributor

LGTM, merged into master and branch-2.1. Thank you all.

asfgit pushed a commit that referenced this pull request Nov 30, 2016
## What changes were proposed in this pull request?

Add python api for KMeansSummary
## How was this patch tested?

unit test added

Author: Jeff Zhang <[email protected]>

Closes #13557 from zjffdu/SPARK-15819.

(cherry picked from commit 4c82ca8)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in 4c82ca8 Nov 30, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?

Add python api for KMeansSummary
## How was this patch tested?

unit test added

Author: Jeff Zhang <[email protected]>

Closes apache#13557 from zjffdu/SPARK-15819.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Add python api for KMeansSummary
## How was this patch tested?

unit test added

Author: Jeff Zhang <[email protected]>

Closes apache#13557 from zjffdu/SPARK-15819.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants