feat: Add Semantic Answer Similarity metric #6877

awinml · 2024-01-31T13:00:39Z

Related Issues

Proposed Changes:

Adds support for the Semantic Answer Similarity (SAS) metric to EvaluationResult.calculate_metrics(...)

The _calculate_sas method of EvaluationResult had been updated to compute the SAS metric:

def _calculate_sas(
    self,
    output_key: str,
    regexes_to_ignore=None,
    ignore_case=False,
    ignore_punctuation=False,
    ignore_numbers=False,
    model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    batch_size: int = 32,
    device: Optional[ComponentDevice] = None,
    token: Optional[Union[str, bool]] = None,
) -> MetricsResult:
     ....
     return MetricsResult({"sas": sas_score, "scores": similarity_scores})

Usage:

For evaluation of a pipeline:

pipeline = Pipeline()
inputs = [...]
expected_outputs = [...]

eval_result = eval(pipeline, inputs=inputs, expected_outputs=expected_outputs)
sas_metric = eval_result.calculate_metrics(Metric.SAS, output_key="answers")

How did you test it?

Unit tests have been added.
End-to-end tests with the following pipelines have been added:

Extractive QA Pipeline
RAG Pipeline with BM25 Retriever
RAG Pipeline with Embedding Retriever

Notes for the reviewer:

Certain cross encoders (like "ms-marco-MiniLM-L-6-v2") provide us with un-normalized similarity scores by simply outputting the logits. Since the mean of the normalized scores is what we use to calculate the final SAS score, it is necessary to normalize the logits by applying the sigmoid. For more information, please have a look at this issue.

In this implementation, we apply sigmoid to the logits returned by the cross encoder if they are greater than 1(un-normalized).

Alternatively, we could provide an optional normalize parameter. We decided against using this approach because it would produce SAS scores greater than 1, in the case that a cross-encoder model was passed with normalize=False. This would make the results hard to interpret and compare.

The tests in test_eval_sas.py have been marked as integration tests, since they send need to send an API call to HuggingFace for fetching the SAS model config.

This code was written collaboratively with @vrunm.

coveralls · 2024-01-31T13:09:41Z

Pull Request Test Coverage Report for Build 7758146880

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
35 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.3%) to 88.647%

Files with Coverage Reduction	New Missed Lines	%
evaluation/eval.py	35	66.98%

Totals
Change from base Build 7725584602:	-0.3%
Covered Lines:	4638
Relevant Lines:	5232

💛 - Coveralls

haystack/evaluation/eval.py

Co-authored-by: Silvano Cerza <[email protected]>

silvanocerza

Great job as always. Thank you both. 🙏

…hods

qiisziilbash · 2024-09-11T16:20:29Z

haystack/evaluation/eval.py

+
+            # All Cross Encoders do not return a set of logits scores that are normalized
+            # We normalize scores if they are larger than 1
+            if (similarity_scores > 1).any():


if (abs(similarity_scores) > 1).any(): ?

awinml added 2 commits January 31, 2024 17:42

Add SAS metric

510fbca

Add release notes

8f4f718

awinml requested review from a team as code owners January 31, 2024 13:00

awinml requested review from dfokina and silvanocerza and removed request for a team January 31, 2024 13:00

github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Jan 31, 2024

awinml added 2 commits January 31, 2024 19:46

Round similarity scores for precision consistency

099a83a

Add tolerance to tests

74d4cb5

silvanocerza reviewed Feb 2, 2024

View reviewed changes

haystack/evaluation/eval.py Outdated Show resolved Hide resolved

Update haystack/evaluation/eval.py

2e81780

Co-authored-by: Silvano Cerza <[email protected]>

silvanocerza approved these changes Feb 2, 2024

View reviewed changes

Add types for preprocess_text; Add additional types for f1 and em met…

3a5512c

…hods

silvanocerza merged commit 393a799 into deepset-ai:main Feb 2, 2024
23 checks passed

qiisziilbash reviewed Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Semantic Answer Similarity metric #6877

feat: Add Semantic Answer Similarity metric #6877

awinml commented Jan 31, 2024

coveralls commented Jan 31, 2024 •

edited

Loading

silvanocerza left a comment

qiisziilbash Sep 11, 2024

feat: Add Semantic Answer Similarity metric #6877

feat: Add Semantic Answer Similarity metric #6877

Conversation

awinml commented Jan 31, 2024

Related Issues

Proposed Changes:

Usage:

How did you test it?

Notes for the reviewer:

coveralls commented Jan 31, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7758146880

Warning: This coverage report may be inaccurate.

💛 - Coveralls

silvanocerza left a comment

Choose a reason for hiding this comment

qiisziilbash Sep 11, 2024

Choose a reason for hiding this comment

coveralls commented Jan 31, 2024 •

edited

Loading