You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently leverage some llm based evaluation metrics from ragas: https://github.com/explodinggradients/ragas
namely, llm_context_precision, llm_context_recall and llm_answer_relevance in this function compute_llm_based_score. These are the RAG triad of metrics.
For rag usecases, however we have an alternative llm-as-a-judge framework provided by promptflow-evals (supported by Microsoft and part of promptflow): https://pypi.org/project/promptflow-evals/
This evaluation framework has quality metrics such as relevance that can be leveraged for answer relevance or context precision. It has a targeted prompt for groundedness. promptflow-evals also has other quality metrics such as coherence, style, fluency, similarity. Moreover, the package also can enable inclusion of safety metrics such hate unfairness, violence, sexual among others.
Ideally, this can serve as a replacement for ragas metrics, but we can integrate promptflow-evals first and make a decision about removing ragas in a subsequent issue given many might be using ragas metrics.
The text was updated successfully, but these errors were encountered:
We currently leverage some llm based evaluation metrics from ragas: https://github.com/explodinggradients/ragas
namely,
llm_context_precision
,llm_context_recall
andllm_answer_relevance
in this functioncompute_llm_based_score
. These are the RAG triad of metrics.For rag usecases, however we have an alternative llm-as-a-judge framework provided by promptflow-evals (supported by Microsoft and part of promptflow): https://pypi.org/project/promptflow-evals/
This evaluation framework has quality metrics such as
relevance
that can be leveraged for answer relevance or context precision. It has a targeted prompt for groundedness. promptflow-evals also has other quality metrics such as coherence, style, fluency, similarity. Moreover, the package also can enable inclusion of safety metrics such hate unfairness, violence, sexual among others.Ideally, this can serve as a replacement for ragas metrics, but we can integrate promptflow-evals first and make a decision about removing ragas in a subsequent issue given many might be using ragas metrics.
The text was updated successfully, but these errors were encountered: