Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Semantic Answer Similarity metric #6877

Merged
merged 6 commits into from
Feb 2, 2024

Conversation

awinml
Copy link
Contributor

@awinml awinml commented Jan 31, 2024

Related Issues

Fixes #6069

Proposed Changes:

Adds support for the Semantic Answer Similarity (SAS) metric to EvaluationResult.calculate_metrics(...)

The _calculate_sas method of EvaluationResult had been updated to compute the SAS metric:

def _calculate_sas(
    self,
    output_key: str,
    regexes_to_ignore=None,
    ignore_case=False,
    ignore_punctuation=False,
    ignore_numbers=False,
    model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    batch_size: int = 32,
    device: Optional[ComponentDevice] = None,
    token: Optional[Union[str, bool]] = None,
) -> MetricsResult:
     ....
     return MetricsResult({"sas": sas_score, "scores": similarity_scores})

Usage:

For evaluation of a pipeline:

pipeline = Pipeline()
inputs = [...]
expected_outputs = [...]

eval_result = eval(pipeline, inputs=inputs, expected_outputs=expected_outputs)
sas_metric = eval_result.calculate_metrics(Metric.SAS, output_key="answers")

How did you test it?

Unit tests have been added.
End-to-end tests with the following pipelines have been added:

  • Extractive QA Pipeline
  • RAG Pipeline with BM25 Retriever
  • RAG Pipeline with Embedding Retriever

Notes for the reviewer:

Certain cross encoders (like "ms-marco-MiniLM-L-6-v2") provide us with un-normalized similarity scores by simply outputting the logits. Since the mean of the normalized scores is what we use to calculate the final SAS score, it is necessary to normalize the logits by applying the sigmoid. For more information, please have a look at this issue.

In this implementation, we apply sigmoid to the logits returned by the cross encoder if they are greater than 1(un-normalized).

Alternatively, we could provide an optional normalize parameter. We decided against using this approach because it would produce SAS scores greater than 1, in the case that a cross-encoder model was passed with normalize=False. This would make the results hard to interpret and compare.

The tests in test_eval_sas.py have been marked as integration tests, since they send need to send an API call to HuggingFace for fetching the SAS model config.


This code was written collaboratively with @vrunm.

@awinml awinml requested review from a team as code owners January 31, 2024 13:00
@awinml awinml requested review from dfokina and silvanocerza and removed request for a team January 31, 2024 13:00
@github-actions github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Jan 31, 2024
@coveralls
Copy link
Collaborator

coveralls commented Jan 31, 2024

Pull Request Test Coverage Report for Build 7758146880

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 35 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 88.647%

Files with Coverage Reduction New Missed Lines %
evaluation/eval.py 35 66.98%
Totals Coverage Status
Change from base Build 7725584602: -0.3%
Covered Lines: 4638
Relevant Lines: 5232

💛 - Coveralls

Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job as always. Thank you both. 🙏

@silvanocerza silvanocerza merged commit 393a799 into deepset-ai:main Feb 2, 2024
23 checks passed

# All Cross Encoders do not return a set of logits scores that are normalized
# We normalize scores if they are larger than 1
if (similarity_scores > 1).any():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (abs(similarity_scores) > 1).any(): ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement function to calculate Semantic Answer Similarity metric
4 participants