Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Minor][ML] Refactor clustering summary. #15555

Closed
wants to merge 3 commits into from

Conversation

yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

Abstract ClusteringSummary from KMeansSummary, GaussianMixtureSummary and BisectingSummary, and eliminate duplicated pieces of code.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67198 has finished for PR 15555 at commit 3169267.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @zhengruifeng @srowen

* @param featuresCol Name for column of features in `predictions`
* @param k Number of clusters
*/
@Experimental
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding @Since("2.1.0") here?
Create a new scala file named Clustering.scala and move ClusteringSummary into it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusteringSummary will be succeeded by summaries who were added in different version, so I think we should not add since version here. To the issue for a new file, I think ClusteringSummary is a small class, we can place it here temporarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain on the official policy for the @Since tags, but it seems better to me to put @Since("2.1.0") here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also ambivalent about this, the reason behind my change is that some classes such as KMeansSummary and GaussianMixtureSummary were added at 2.0. If I put @Since("2.1.0") here, it looks not quite right, but I'm not sure whether it's OK. @jkbradley What's your opinion? Thanks.

@yanboliang
Copy link
Contributor Author

@srowen Would you mind to have a look when you available? Thanks.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only minor comments.

predictionCol: String,
featuresCol: String,
k: Int)
extends ClusteringSummary (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this can fit on one line like: k: Int) extends ClusteringSummary(..., ..., ...)

* :: Experimental ::
* Summary of clustering.
*
* @param predictions [[DataFrame]] produced by model.transform()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add periods for each line

* @param featuresCol Name for column of features in `predictions`
* @param k Number of clusters
*/
@Experimental
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain on the official policy for the @Since tags, but it seems better to me to put @Since("2.1.0") here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though.


/**
* :: Experimental ::
* Summary of clustering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Summary of clustering algorithms." ?

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67490 has finished for PR 15555 at commit f13f240.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

jenkins test this please

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67494 has finished for PR 15555 at commit f13f240.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

Looks good to me.

@sethah
Copy link
Contributor

sethah commented Oct 25, 2016

@yanboliang What do you think about adding the @Since tags?

@jkbradley
Copy link
Member

I'd say add Since tags wherever applicable in concrete classes. But if it's an abstract method, we probably should not since they will be incorrect for new child classes later on.

@transient lazy val cluster: DataFrame = predictions.select(predictionCol)
predictions: DataFrame,
predictionCol: String,
val probabilityCol: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do a Since tag here

* @param k Number of clusters.
*/
@Experimental
class ClusteringSummary private[clustering] (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is generic to clustering, how about putting it in a new file?

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67575 has finished for PR 15555 at commit 946ee73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
I'll merge this with master
Thanks!

@asfgit asfgit closed this in ea3605e Oct 26, 2016
@yanboliang yanboliang deleted the clustering-summary branch October 26, 2016 23:06
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?
Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <[email protected]>

Closes apache#15555 from yanboliang/clustering-summary.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <[email protected]>

Closes apache#15555 from yanboliang/clustering-summary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants