-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Minor][ML] Refactor clustering summary. #15555
Conversation
Test build #67198 has finished for PR 15555 at commit
|
* @param featuresCol Name for column of features in `predictions` | ||
* @param k Number of clusters | ||
*/ | ||
@Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding @Since("2.1.0")
here?
Create a new scala file named Clustering.scala
and move ClusteringSummary
into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusteringSummary
will be succeeded by summaries who were added in different version, so I think we should not add since version here. To the issue for a new file, I think ClusteringSummary
is a small class, we can place it here temporarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely certain on the official policy for the @Since
tags, but it seems better to me to put @Since("2.1.0")
here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also ambivalent about this, the reason behind my change is that some classes such as KMeansSummary
and GaussianMixtureSummary
were added at 2.0. If I put @Since("2.1.0")
here, it looks not quite right, but I'm not sure whether it's OK. @jkbradley What's your opinion? Thanks.
@srowen Would you mind to have a look when you available? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left only minor comments.
predictionCol: String, | ||
featuresCol: String, | ||
k: Int) | ||
extends ClusteringSummary ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this can fit on one line like: k: Int) extends ClusteringSummary(..., ..., ...)
* :: Experimental :: | ||
* Summary of clustering. | ||
* | ||
* @param predictions [[DataFrame]] produced by model.transform() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add periods for each line
* @param featuresCol Name for column of features in `predictions` | ||
* @param k Number of clusters | ||
*/ | ||
@Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely certain on the official policy for the @Since
tags, but it seems better to me to put @Since("2.1.0")
here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though.
|
||
/** | ||
* :: Experimental :: | ||
* Summary of clustering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Summary of clustering algorithms." ?
Test build #67490 has finished for PR 15555 at commit
|
jenkins test this please |
Test build #67494 has finished for PR 15555 at commit
|
Looks good to me. |
@yanboliang What do you think about adding the |
I'd say add Since tags wherever applicable in concrete classes. But if it's an abstract method, we probably should not since they will be incorrect for new child classes later on. |
@transient lazy val cluster: DataFrame = predictions.select(predictionCol) | ||
predictions: DataFrame, | ||
predictionCol: String, | ||
val probabilityCol: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do a Since tag here
* @param k Number of clusters. | ||
*/ | ||
@Experimental | ||
class ClusteringSummary private[clustering] ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is generic to clustering, how about putting it in a new file?
Test build #67575 has finished for PR 15555 at commit
|
LGTM |
## What changes were proposed in this pull request? Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code. ## How was this patch tested? Existing tests. Author: Yanbo Liang <[email protected]> Closes apache#15555 from yanboliang/clustering-summary.
## What changes were proposed in this pull request? Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code. ## How was this patch tested? Existing tests. Author: Yanbo Liang <[email protected]> Closes apache#15555 from yanboliang/clustering-summary.
What changes were proposed in this pull request?
Abstract
ClusteringSummary
fromKMeansSummary
,GaussianMixtureSummary
andBisectingSummary
, and eliminate duplicated pieces of code.How was this patch tested?
Existing tests.