Add promptflow-evals quality metrics as an alternative to ragas #707

prvenk · 2024-09-13T02:18:24Z

We currently leverage some llm based evaluation metrics from ragas: https://github.com/explodinggradients/ragas
namely, llm_context_precision, llm_context_recall and llm_answer_relevance in this function compute_llm_based_score. These are the RAG triad of metrics.

For rag usecases, however we have an alternative llm-as-a-judge framework provided by promptflow-evals (supported by Microsoft and part of promptflow): https://pypi.org/project/promptflow-evals/

This evaluation framework has quality metrics such as relevance that can be leveraged for answer relevance or context precision. It has a targeted prompt for groundedness. promptflow-evals also has other quality metrics such as coherence, style, fluency, similarity. Moreover, the package also can enable inclusion of safety metrics such hate unfairness, violence, sexual among others.

Ideally, this can serve as a replacement for ragas metrics, but we can integrate promptflow-evals first and make a decision about removing ragas in a subsequent issue given many might be using ragas metrics.

The text was updated successfully, but these errors were encountered:

prvenk · 2024-09-13T14:01:37Z

@guybartal @ritesh-modi

prvenk self-assigned this Sep 13, 2024

prvenk added the enhancement New feature or request label Sep 13, 2024

prvenk linked a pull request Sep 17, 2024 that will close this issue

707 add promptflow evals quality metrics #731

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add promptflow-evals quality metrics as an alternative to ragas #707

Add promptflow-evals quality metrics as an alternative to ragas #707

prvenk commented Sep 13, 2024

prvenk commented Sep 13, 2024

Add promptflow-evals quality metrics as an alternative to ragas #707

Add promptflow-evals quality metrics as an alternative to ragas #707

Comments

prvenk commented Sep 13, 2024

prvenk commented Sep 13, 2024