-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39446][MLLIB] Add relevance score for nDCG evaluation #36843
Conversation
Can one of the admins verify this patch? |
3289d65
to
74ac0a5
Compare
class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])]) | ||
extends Logging with Serializable { | ||
class RankingMetrics[T: ClassTag]( | ||
predictionAndLabels: RDD[(Array[T], Array[T], Array[(T, Double)])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why does the last need to be (T, Double) pairs? Wouldn't this be an attribute of the ground truth? Looking this up via Map seems clunky later. The problem is not changing the binary signature, but, at least, how about a third array of Double only, that is parallel to the second array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review.
You definitely have a point. I changed from (T, Double)
to Double
.
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala
Show resolved
Hide resolved
if (i < pred.length && labSet.contains(pred(i))) { | ||
dcg += gain | ||
predictionAndLabels | ||
.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert these changes that made one line into 3? the original was better IMHO
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala
Outdated
Show resolved
Hide resolved
maxDcg += gain | ||
} | ||
} else { | ||
if (i < pred.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not need labSet.contains(pred(i))
?
I would imagine the only difference is the computation of gain
, between these two cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We manage this by relMap.getOrElse(pred(i), 0.0)
below.
I would imagine the only difference is the computation of gain, between these two cases
Basically yes. So we could write in the way you suggest without getOrElse
.
countRelevantItemRatio(pred, lab, k, k) | ||
}.mean() | ||
predictionAndLabels.map { case (pred, lab, _) => | ||
countRelevantItemRatio(pred, lab, k, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unindent - just match how the code was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed !
@@ -35,8 +35,15 @@ import org.apache.spark.rdd.RDD | |||
* @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs. | |||
*/ | |||
@Since("1.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, we need to update the @Since
tags. I think you can somehow add a @Since
annotation to the default constructor args? as it's since 3.4.0. I'm not sure exactly where it goes. The old constructor can remain since 1.2.0. If that doesn't work, maybe we can leave this constructor and add the new one as a new def this(...)
since 3.4.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made changes and updated the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should refer to BinaryClassificationMetrics
and MulticlassMetrics
, in which RDD[_ <: Product]
was used as the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since end users now mainly use the .ml
, is there any plan to expose this function to .ml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry in advance for my poor understanding. I have some questions.
- What is
.ml
? Do you meanorg.apache.spark.ml
? RDD[_ <: Product]
is used to makeMulticlassMetric
class available to.ml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first one - yeah this change is in the 'older' .mllib package. I don't think there is an equivalent for it in the DataFrame-based .ml packages, so, maybe we can ignore that here. But if nDCG is supported in the .ml package somewhere and I forgot it, would be good to add it there too.
The declaration suggested here might actually work for both input types without a separate constructor. Try it maybe? if it works, yes, that is simpler, and lets this API support even more inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation!
Could you review this?
#36920
…thon ### What changes were proposed in this pull request? - Updated `RankingMetrics` for Java and Python - Modified the interface for Java and Python - Added test for Java ### Why are the changes needed? - To expose the change in #36843 to Java and Python. - To update the document for Java and Python. ### Does this PR introduce _any_ user-facing change? - Java users can use a JavaRDD of (predicted ranking, ground truth set, relevance value of ground truth set) for `RankingMetrics` ### How was this patch tested? - Added test for Java Closes #37019 from uchiiii/modify_ranking_metrics_for_java_and_python. Authored-by: uchiiii <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…cgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: #36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…cgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: #36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 72af2c0) Signed-off-by: Sean Owen <[email protected]>
…cgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: #36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 72af2c0) Signed-off-by: Sean Owen <[email protected]>
…cgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: apache#36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes apache#42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 72af2c0) Signed-off-by: Sean Owen <[email protected]>
…cgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: apache#36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes apache#42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
ndcgAt
.RankingMetrics
class.Why are the changes needed?
ndcgAt
function to be able to treat relevance score.Does this PR introduce any user-facing change?
RankingMetrics
class forndcgAt
function, so now it acceptsRDD[(Array[T], Array[T], Array[Double])]
orRDD[(Array[T], Array[T])])
as the constructor arguments while theRDD[(Array[T], Array[T])])
was only accepted.How was this patch tested?