feat: Add Evaluation and metrics calculation for components and Pipelines #6464

awinml · 2023-11-30T21:07:33Z

Related Issues

fixes Implement statistical-based evaluation and metrics calculation #6061

Proposed Changes:

Based on the design in the Evaluation proposal (#5794), we have implemented the following classes for the evaluation of components and Pipelines:

Eval function:

def eval(
    runnable: Union[Pipeline, component], inputs: List[Dict[str, Any]], expected_outputs: List[Dict[str, Any]]
) -> EvaluationResult:
    ...
    return EvaluationResult(runnable, inputs, outputs, expected_outputs)

The eval function evaluates the given runnable - Pipeline or component using the given inputs list. It also takes in a list of expected_outputs.

The eval function returns an EvaluationResult object.

EvaluationResult:

class EvaluationResult:
    def __init__(
        self,
        runnable: Union[Pipeline, component],
        inputs: List[Dict[str, Any]],
        outputs: List[Dict[str, Any]],
        expected_outputs: List[Dict[str, Any]],
    ) -> None:

EvaluationResult keeps track of all the information used by eval(), namely the runnable - Pipeline/component, inputs, outputs and expected_ouputs.

EvaluationResult has the following methods:

serialize:

For serialization, this implementation proposes the use of orjson to store the components of EvaluationResult.

Orjson is secure and the fastest python library for JSON serialization according to benchmarks.

Orjson serializes Python objects to JSON and the output is a bytes object containing UTF-8.

We serialize the following contents of EvaluationResult to JSON using orjson:

Serialized runnable (Pipeline or Component), Dict[str, Any]
Inputs, List[Dict[str, Any]]
Outputs, List[Dict[str, Any]]
Expected outputs, List[Dict[str, Any]]

The serialize method has the following signature:

def serialize(self) -> bytes:

To convert and store the Haystack objects to a format that orjson can serialize, we use the following approach:

The Pipeline/component is converted to its serialized form using their corresponding dumps()/ to_dict() methods.
The outputs and expected_outputs are unpacked using the helper functions convert_objects_to_dict. These helper functions iterate over the outputs of the Pipeline/component and call the respective to_dict() and from_dict() methods to convert the objects to a serializable format. The dictionary outputs are stored back in the same Pipeline/component output format as before.
A dictionary of these converted objects is passed to orjson for serialization.

In this approach, rather than implementing specific methods in the EvaluationResult class to handle each output type and serialize it individually, we use the serialization methods for each component and dataclass to convert the objects to python data types. These objects are then passed to orjson to store as JSON.

deserialize:

The deserialize method returns a EvaluationResult instance with:

Deserialized runnable (Pipeline or Component)
Inputs
Outputs
Expected outputs

The deserialize method has the following signature:

@classmethod
def deserialize(cls, data: bytes):

calculate_metrics:

The calculate_metrics method is used to calculate predetermined metrics or custom ones.
The known metrics are defined in the Metric class as an enum.

def calculate_metrics(self, metric: Union[Metric, Callable[..., MetricsResult]], **kwargs) -> MetricsResult:

The method takes a Metric and returns a MetricsResult instance.

Based on the Metric specified, hidden methods like _calculate_recall, _calculate_f1, etc. are called to compute the values of the metrics during the evaluation.

Initial implementations for Recall, Accuracy, F1, EM and SAS metrics have been added. The implementations for the metrics are still a work-in-progress.

Metric:

The Metric class inherits from enum to ease discoverability and documentation. It keeps a list of standard supported metrics.

class Metric(Enum):
    ACCURACY = "Accuracy"
    RECALL = "Recall"
    MRR = "Mean Reciprocal Rank"
    MAP = "Mean Average Precision"
    EM = "Exact Match"
    F1 = "F1"
    SAS = "SemanticAnswerSimilarity"

MetricsResult:

The MetricsResult class inherits from dict and is used to store the computed metrics after calling calculate_metrics() on an EvaluationResult.

MetricsResult has a save() method to store the metrics to a json file.

Added serialization for `Answer`, `ExtractedAnswer` and `GeneratedAnswer`

For serialization of Answer, ExtractedAnswer and GeneratedAnswer, to_dict() and from_dict() methods have been added to these classes for serializing them to dictionaries.

Corresponding Unit Tests have also been added.

How did you test it?

The following pipelines were used to test the evaluation:

RAG Pipeline with BM25 Retriever
RAG Pipeline with Embedding Retriever
Hybrid RAG Pipeline

Ongoing Work:

Add to_dict() and from_dict() methods to ChatMessage and ByteStream, so that components returning this can be serialized.
Add Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) metrics.
Add Unit tests for _calculate_accuracy(), _calculate_recall, _calculate_f1, _calculate_em, _calculate_sas.
Add evaluation of isolated components, progress tracking and simulated outputs.
(The current implementation only performs evaluation on the final outputs of the Pipeline/component.)

This code was written collaboratively with @vrunm.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

…dAnswer

silvanocerza · 2023-12-06T09:54:58Z

Hey @awinml! Love that you're working on this, really appreciated! 🙏

I would suggest to split this into multiple PRs though. It will make it much easier for us to review and will get in the project much faster.

awinml · 2023-12-07T10:26:58Z

Thanks! @silvanocerza

I have split the code into multiple PRs. The implementation for eval and EvaluationResult has been added to #6505. I will add the remaining implementation in subsequent PRs.

julian-risch · 2023-12-08T10:43:10Z

Thank you @awinml ! We need some time to review the three PRs and will get back to you next week!

awinml · 2023-12-08T11:33:56Z

Thanks @julian-risch. The code from this draft PR has been moved to:

Separate PRs for remaining functionality (_calculate_metrics(), MetricsResult and implementation of the metrics) will be opened after the implementation of eval and EvaluationResult has been finalized.

Closing this draft PR for now, in favour of #6505 and #6506.

awinml added 2 commits December 1, 2023 02:17

Add eval, EvaluationResult, Metric and MetricsResult

6f370c8

Add release notes

345a39c

github-actions bot added topic:tests type:documentation Improvements on the docs labels Nov 30, 2023

awinml mentioned this pull request Dec 1, 2023

Implement statistical-based evaluation and metrics calculation #6061

Closed

Add to_dict() and from_dict() to Answer, ExtractedAnswer and Generate…

50eea38

…dAnswer

awinml closed this Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Evaluation and metrics calculation for components and Pipelines #6464

feat: Add Evaluation and metrics calculation for components and Pipelines #6464

awinml commented Nov 30, 2023 •

edited

Loading

silvanocerza commented Dec 6, 2023

awinml commented Dec 7, 2023

julian-risch commented Dec 8, 2023

awinml commented Dec 8, 2023

feat: Add Evaluation and metrics calculation for components and Pipelines #6464

feat: Add Evaluation and metrics calculation for components and Pipelines #6464

Conversation

awinml commented Nov 30, 2023 • edited Loading

Related Issues

Proposed Changes:

Eval function:

EvaluationResult:

Metric:

MetricsResult:

Added serialization for Answer, ExtractedAnswer and GeneratedAnswer

How did you test it?

Ongoing Work:

Checklist

silvanocerza commented Dec 6, 2023

awinml commented Dec 7, 2023

julian-risch commented Dec 8, 2023

awinml commented Dec 8, 2023

awinml commented Nov 30, 2023 •

edited

Loading

Added serialization for `Answer`, `ExtractedAnswer` and `GeneratedAnswer`