feat: Add Evaluation and metrics calculation for components and Pipelines #6464
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues
Proposed Changes:
Based on the design in the Evaluation proposal (#5794), we have implemented the following classes for the evaluation of components and Pipelines:
Eval function:
The
eval
function evaluates the given runnable -Pipeline
orcomponent
using the giveninputs
list. It also takes in a list ofexpected_outputs
.The
eval
function returns anEvaluationResult
object.EvaluationResult:
EvaluationResult
keeps track of all the information used byeval()
, namely the runnable -Pipeline
/component
,inputs
,outputs
andexpected_ouputs
.EvaluationResult
has the following methods:serialize
:For serialization, this implementation proposes the use of orjson to store the components of EvaluationResult.
Orjson is secure and the fastest python library for JSON serialization according to benchmarks.
Orjson serializes Python objects to JSON and the output is a bytes object containing UTF-8.
We serialize the following contents of EvaluationResult to JSON using orjson:
Dict[str, Any]
List[Dict[str, Any]]
List[Dict[str, Any]]
List[Dict[str, Any]]
The serialize method has the following signature:
To convert and store the Haystack objects to a format that orjson can serialize, we use the following approach:
The Pipeline/component is converted to its serialized form using their corresponding
dumps()
/to_dict()
methods.The
outputs
andexpected_outputs
are unpacked using the helper functionsconvert_objects_to_dict
. These helper functions iterate over the outputs of the Pipeline/component and call the respectiveto_dict()
andfrom_dict()
methods to convert the objects to a serializable format. The dictionary outputs are stored back in the same Pipeline/component output format as before.A dictionary of these converted objects is passed to orjson for serialization.
In this approach, rather than implementing specific methods in the EvaluationResult class to handle each output type and serialize it individually, we use the serialization methods for each component and dataclass to convert the objects to python data types. These objects are then passed to orjson to store as JSON.
deserialize
:The deserialize method returns a
EvaluationResult
instance with:The deserialize method has the following signature:
calculate_metrics
:The
calculate_metrics
method is used to calculate predetermined metrics or custom ones.The known metrics are defined in the
Metric
class as an enum.The method takes a
Metric
and returns aMetricsResult
instance.Based on the
Metric
specified, hidden methods like_calculate_recall
,_calculate_f1
, etc. are called to compute the values of the metrics during the evaluation.Initial implementations for Recall, Accuracy, F1, EM and SAS metrics have been added. The implementations for the metrics are still a work-in-progress.
Metric:
The
Metric
class inherits fromenum
to ease discoverability and documentation. It keeps a list of standard supported metrics.MetricsResult:
The
MetricsResult
class inherits fromdict
and is used to store the computed metrics after callingcalculate_metrics()
on anEvaluationResult
.MetricsResult
has asave()
method to store the metrics to a json file.Added serialization for
Answer
,ExtractedAnswer
andGeneratedAnswer
For serialization of
Answer
,ExtractedAnswer
andGeneratedAnswer
,to_dict()
andfrom_dict()
methods have been added to these classes for serializing them to dictionaries.Corresponding Unit Tests have also been added.
How did you test it?
The following pipelines were used to test the evaluation:
Ongoing Work:
Add
to_dict()
andfrom_dict()
methods toChatMessage
andByteStream
, so that components returning this can be serialized.Add Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) metrics.
Add Unit tests for
_calculate_accuracy()
,_calculate_recall
,_calculate_f1
,_calculate_em
,_calculate_sas
.Add evaluation of isolated components, progress tracking and simulated outputs.
(The current implementation only performs evaluation on the final outputs of the Pipeline/component.)
This code was written collaboratively with @vrunm.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.