Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Evaluation and metrics calculation for components and Pipelines #6464

Closed
wants to merge 3 commits into from

Conversation

awinml
Copy link
Contributor

@awinml awinml commented Nov 30, 2023

Related Issues

Proposed Changes:

Based on the design in the Evaluation proposal (#5794), we have implemented the following classes for the evaluation of components and Pipelines:

Eval function:

def eval(
    runnable: Union[Pipeline, component], inputs: List[Dict[str, Any]], expected_outputs: List[Dict[str, Any]]
) -> EvaluationResult:
    ...
    return EvaluationResult(runnable, inputs, outputs, expected_outputs)

The eval function evaluates the given runnable - Pipeline or component using the given inputs list. It also takes in a list of expected_outputs.

The eval function returns an EvaluationResult object.

EvaluationResult:

class EvaluationResult:
    def __init__(
        self,
        runnable: Union[Pipeline, component],
        inputs: List[Dict[str, Any]],
        outputs: List[Dict[str, Any]],
        expected_outputs: List[Dict[str, Any]],
    ) -> None:

EvaluationResult keeps track of all the information used by eval(), namely the runnable - Pipeline/component, inputs, outputs and expected_ouputs.

EvaluationResult has the following methods:

  1. serialize:

For serialization, this implementation proposes the use of orjson to store the components of EvaluationResult.

Orjson is secure and the fastest python library for JSON serialization according to benchmarks.

Orjson serializes Python objects to JSON and the output is a bytes object containing UTF-8.

We serialize the following contents of EvaluationResult to JSON using orjson:

  • Serialized runnable (Pipeline or Component), Dict[str, Any]
  • Inputs, List[Dict[str, Any]]
  • Outputs, List[Dict[str, Any]]
  • Expected outputs, List[Dict[str, Any]]

The serialize method has the following signature:

def serialize(self) -> bytes:

To convert and store the Haystack objects to a format that orjson can serialize, we use the following approach:

  • The Pipeline/component is converted to its serialized form using their corresponding dumps()/ to_dict() methods.

  • The outputs and expected_outputs are unpacked using the helper functions convert_objects_to_dict. These helper functions iterate over the outputs of the Pipeline/component and call the respective to_dict() and from_dict() methods to convert the objects to a serializable format. The dictionary outputs are stored back in the same Pipeline/component output format as before.

  • A dictionary of these converted objects is passed to orjson for serialization.

In this approach, rather than implementing specific methods in the EvaluationResult class to handle each output type and serialize it individually, we use the serialization methods for each component and dataclass to convert the objects to python data types. These objects are then passed to orjson to store as JSON.

  1. deserialize:

The deserialize method returns a EvaluationResult instance with:

  • Deserialized runnable (Pipeline or Component)
  • Inputs
  • Outputs
  • Expected outputs

The deserialize method has the following signature:

@classmethod
def deserialize(cls, data: bytes):
  1. calculate_metrics:

The calculate_metrics method is used to calculate predetermined metrics or custom ones.
The known metrics are defined in the Metric class as an enum.

def calculate_metrics(self, metric: Union[Metric, Callable[..., MetricsResult]], **kwargs) -> MetricsResult:

The method takes a Metric and returns a MetricsResult instance.

Based on the Metric specified, hidden methods like _calculate_recall, _calculate_f1, etc. are called to compute the values of the metrics during the evaluation.

Initial implementations for Recall, Accuracy, F1, EM and SAS metrics have been added. The implementations for the metrics are still a work-in-progress.

Metric:

The Metric class inherits from enum to ease discoverability and documentation. It keeps a list of standard supported metrics.

class Metric(Enum):
    ACCURACY = "Accuracy"
    RECALL = "Recall"
    MRR = "Mean Reciprocal Rank"
    MAP = "Mean Average Precision"
    EM = "Exact Match"
    F1 = "F1"
    SAS = "SemanticAnswerSimilarity"

MetricsResult:

The MetricsResult class inherits from dict and is used to store the computed metrics after calling calculate_metrics() on an EvaluationResult.

MetricsResult has a save() method to store the metrics to a json file.

Added serialization for Answer, ExtractedAnswer and GeneratedAnswer

For serialization of Answer, ExtractedAnswer and GeneratedAnswer, to_dict() and from_dict() methods have been added to these classes for serializing them to dictionaries.

Corresponding Unit Tests have also been added.

How did you test it?

The following pipelines were used to test the evaluation:

  • RAG Pipeline with BM25 Retriever
  • RAG Pipeline with Embedding Retriever
  • Hybrid RAG Pipeline

Ongoing Work:

  • Add to_dict() and from_dict() methods to ChatMessage and ByteStream, so that components returning this can be serialized.

  • Add Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) metrics.

  • Add Unit tests for _calculate_accuracy(), _calculate_recall, _calculate_f1, _calculate_em, _calculate_sas.

  • Add evaluation of isolated components, progress tracking and simulated outputs.
    (The current implementation only performs evaluation on the final outputs of the Pipeline/component.)

This code was written collaboratively with @vrunm.

Checklist

@silvanocerza
Copy link
Contributor

Hey @awinml! Love that you're working on this, really appreciated! 🙏

I would suggest to split this into multiple PRs though. It will make it much easier for us to review and will get in the project much faster.

@awinml
Copy link
Contributor Author

awinml commented Dec 7, 2023

Thanks! @silvanocerza

I have split the code into multiple PRs. The implementation for eval and EvaluationResult has been added to #6505. I will add the remaining implementation in subsequent PRs.

@julian-risch
Copy link
Member

Thank you @awinml ! We need some time to review the three PRs and will get back to you next week!

@awinml
Copy link
Contributor Author

awinml commented Dec 8, 2023

Thanks @julian-risch. The code from this draft PR has been moved to:

Separate PRs for remaining functionality (_calculate_metrics(), MetricsResult and implementation of the metrics) will be opened after the implementation of eval and EvaluationResult has been finalized.

Closing this draft PR for now, in favour of #6505 and #6506.

@awinml awinml closed this Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement statistical-based evaluation and metrics calculation
3 participants