Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate other ranking evaluation metrics #29653

Closed
cbuescher opened this issue Apr 23, 2018 · 11 comments
Closed

Investigate other ranking evaluation metrics #29653

cbuescher opened this issue Apr 23, 2018 · 11 comments
Labels
>feature Meta :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@cbuescher
Copy link
Member

From an old dicussion in our forums I just learned about another interesting looking ranking evaluation metric used in some TREC competitions called "bpref" that is advertised to work well with incomplete data.

I'm opening this issue to do some more investigation into this and other evaluation metrics that we haven't considered yet.

Regarding bpref its atm. unclear to me:

  • how widely used it is
  • in which use cases it might perform better than the metric we currently offer
  • if we can implement it with our current API that is based in msearch or if we would need to change something to make it work
@javanna javanna added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Apr 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@cbuescher cbuescher changed the title Investigate more ranking evaluation metrics Investigate other ranking evaluation metrics Apr 24, 2018
@rpedela
Copy link

rpedela commented Apr 24, 2018

I would love to see expected reciprocal rank (ERR) added.

@cbuescher
Copy link
Member Author

@rpedela great suggestion, I will look into this as well and how it fits into the current design of the API. Do you already use ERR? If so, for which kind of use case and how does it compare to other metrics (like e.g. nDCG) in your experience?

@rpedela
Copy link

rpedela commented Apr 24, 2018

Doug Turnbull from Open Source Connections does a great job answering your question in this talk starting at 21:18.

@rpedela
Copy link

rpedela commented Apr 24, 2018

One more data point. Ranklib is the de facto learning to rank library and ERR is the default optimization metric used for training.

@cbuescher
Copy link
Member Author

@rpedela I started looking into ERR and found it to be a great additional metric. I've opened a PR at #31891, maybe you'd like to comment if you are familiar with the calculation of this metric and want to check if my understanding of the algorithm is correct. In particular I was wondering about the handling of ungraded search results. The paper assumes complete labels but this is unrealistic in a real-world scenario. For now I opted for an optional, user-supplied "unknown_doc_rating" parameter that gets substituted for search results without a relevance judgment (it could simply be 0 for most cases). If this parameter is not present, unrated documents are just skipped over in the metric calculation. Not sure if that is common practice but would like to hear thought or get pointers on this.

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Jul 10, 2018
This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric
as descriped in:

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009).
Expected reciprocal rank for graded relevance.
Proceeding of the 18th ACM Conference on Information and Knowledge Management.
https://doi.org/10.1145/1645953.1646033

ERR is an extension of the classical reciprocal rank to the graded relevance
case and assumes a cascade browsing model. It quantifies the usefulness of a
document at rank `i` conditioned on the degree of relevance of the items at ranks
less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it
seems like a good metric to support. Also ERR seems to be the default optimization
metric used for training in RankLib, a widely used learning to rank library.

Relates to elastic#29653
cbuescher pushed a commit that referenced this issue Jul 12, 2018
This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric
as descriped in:

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009).
Expected reciprocal rank for graded relevance.
Proceeding of the 18th ACM Conference on Information and Knowledge Management.
https://doi.org/10.1145/1645953.1646033

ERR is an extension of the classical reciprocal rank to the graded relevance
case and assumes a cascade browsing model. It quantifies the usefulness of a
document at rank `i` conditioned on the degree of relevance of the items at ranks
less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it
seems like a good metric to support. Also ERR seems to be the default optimization
metric used for training in RankLib, a widely used learning to rank library.

Relates to #29653
cbuescher pushed a commit that referenced this issue Jul 12, 2018
This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric
as descriped in:

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009).
Expected reciprocal rank for graded relevance.
Proceeding of the 18th ACM Conference on Information and Knowledge Management.
https://doi.org/10.1145/1645953.1646033

ERR is an extension of the classical reciprocal rank to the graded relevance
case and assumes a cascade browsing model. It quantifies the usefulness of a
document at rank `i` conditioned on the degree of relevance of the items at ranks
less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it
seems like a good metric to support. Also ERR seems to be the default optimization
metric used for training in RankLib, a widely used learning to rank library.

Relates to #29653
@cbuescher cbuescher added the Meta label Oct 2, 2018
@cbuescher
Copy link
Member Author

Another possible metric that I recently encountered in presentation is Average Precision (or, if taken across multiple user needs: Mean Average Precison).

@cbuescher
Copy link
Member Author

Moving some thoughts from #20441 here since it seems a better fit to keep tracking it:

In case users are able to label entire datasets (likely more academic / ML use cases) they might be interested in metrics including some sort of recall like f-score, AUC of ROC curve. However, we are doubtfull about the likelihood of this for any practical purpose.

@joshdevins
Copy link
Member

In case users are able to label entire datasets (likely more academic / ML use cases) they might be interested in metrics including some sort of recall like f-score, AUC of ROC curve. However, we are doubtfull about the likelihood of this for any practical purpose.

I think the "entire datasets of labels" is covered by what we offer today in Machine Learning.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@cbuescher cbuescher removed their assignment Sep 23, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@javanna
Copy link
Member

javanna commented Jun 25, 2024

There are no concrete plans to work on this issue. Closing.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2024
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Meta :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants