Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451

Closed
wants to merge 43 commits into from

Conversation

tsunamit
Copy link
Contributor

No description provided.

@tsunamit tsunamit requested a review from a team as a code owner June 18, 2024 23:08
Copy link

github-actions bot commented Jun 18, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Contributor

@bkorycki bkorycki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start! The main issue I see is in how the test creates test items and how LlamaGuard2SUT processes them.

  • Right now the test produces ChatPrompts, but LlamaGuard2SUT can only handle TextPrompts.
  • There are two layers of prompt formatting (in the test and by LlamaGuardAnnotator's default formatter in translate_request())

I think we can simply things by pulling all of the formatting into LlamaGuard2SUT. This would involve:

  • The test produces TextPrompt(text=assistant_response)
  • Define a custom formatting function (if you don’t want to use LlamaGuardAnnotator’s default formatter).
  • Initialize LlamaGuard2SUT’s llama_guard_client with your custom formatter.

plugins/safety_models/modelgauge/tests/eval_model_test.py Outdated Show resolved Hide resolved
plugins/safety_models/modelgauge/tests/eval_model_test.py Outdated Show resolved Hide resolved
plugins/safety_models/modelgauge/suts/llamaguard2_sut.py Outdated Show resolved Hide resolved
plugins/safety_models/modelgauge/tests/eval_model_test.py Outdated Show resolved Hide resolved
@tsunamit
Copy link
Contributor Author

tsunamit commented Jun 24, 2024

@bkorycki question regarding chat vs text prompts.

  1. llama guard 2 as a safety model produces text responses
  2. we have other safety models (llama 3 70b, mistral) that product chat responses

Is it possible to have a single test handle both? or do I need to create 2 different tests

The only difference I recall making was to swap requrest_sut_capabilities

(edit) Oh shoot, realized there are several locations where I had to change it to Chat objects

@bkorycki
Copy link
Contributor

Right now tests have to choose between producing test items that either have TextPrompts or ChatPrompts. It is a SUT's responsibility to handle the different prompt types by implements translate_text_prompt and/or translate_chat_prompt. If a model only accepts chat prompts, then you can you just transform the test's TextPrompt into chat format in translate_text_prompt.

@tsunamit
Copy link
Contributor Author

Here's the refactor and proposed design decision so far

Decisions and assumptions

  1. Safety Models are SUTs- decided to keep this paradigm for now. Creating a new runner and new AnnotationTestItem base classes is a viable approach, but one I'm electing to defer for simplicity
  2. Create 2 separate test types - LlamaGuard tests and Chat model tests are just 2 separate tests. They share reusable logic via util functions

Deferred + to address in future

  1. Disentangling lg2 annotator and lg2 SUT, creating a more logical relationship if sharing code
  2. Refactor entire system to use new runner for annotator evals

@tsunamit tsunamit linked an issue Jun 26, 2024 that may be closed by this pull request
@bkorycki
Copy link
Contributor

I guess my concern with this approach is that tests are built/applied to only one type of a safety SUT. So for every new evaluator we want to test, we need to create 1) a new SafetyModelSUT to transform the annotator into a SUT and 2) a new test with custom processing of the model’s response.

This process increases the amount of effort required to actually use the framework. So --and maybe I'm missing something-- I’m not sure what benefits are gained by awkwardly forcing evaluators in SUT classes.

Using a custom runner would allow new evaluators to be tested without any additional work. Setting this up would just require one new test and the custom runner, which only needs a few modifications to the simple_test_runner.

@tsunamit
Copy link
Contributor Author

tsunamit commented Jun 27, 2024

Good points. After refactoring, here are the design decisions

Design decisions

  1. Safety models are Annotators.
  2. Evaluation of annotators uses it's own annotator runner. No longer need to create SUTs that map to Annotator. Thanks to runner separation, we don't need to worry about conflicting with SUT tests, they're completely separate

Tests and annotators supported

  1. Safety model test for LlamaGuard2 Annotator
  2. Safety model test for chat models e.g. Llama 3 70b chat

Todo

  1. Add Mistral 8x22b instruct annotator

@tsunamit tsunamit changed the title Add safety model eval test for use with LlamaGuard2 and together hosted models Add annotator test runner + LlamaGuard2 annotator test Jun 27, 2024
@tsunamit tsunamit changed the title Add annotator test runner + LlamaGuard2 annotator test Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test Jul 1, 2024
modelgauge/records.py Outdated Show resolved Hide resolved
@bkorycki
Copy link
Contributor

bkorycki commented Jul 1, 2024

I think this looks a lot nicer! I think there's just one high-level issue remaining before we can merge this in: being able to run one test class on multiple annotators. This test always expects a specific interface to an annotation,SafetyModelResponse, regardless of the annotator. Therefore it makes more sense to make the annotators responsible for translating their native annotation into the expected SafetyModelResponse.
This involves:

  • Annotators to "test" are specified at runtime. The runner accepts an annotator(s) param, parallel to the simple runner's SUT param. The test does not return anything in get_annotators().
  • Annotators must return a SafetyModelResponse. We can add a formal check in the runner in the future.
    • In order to test the LlamaGuardAnnotator, we need to just add a new class that extends it and only overrides it's translate_response(). Should be just a couple lines.
  • Only use one test class.


@modelgauge_test(requires_sut_capabilities=[AcceptsChatPrompt])
class SafetyModelTest(PromptResponseTest):
def __init__(self, uid: str, together_api_key: TogetherApiKey):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test requires any api keys now.

@tsunamit tsunamit requested a review from bkorycki July 16, 2024 01:06
@wpietri wpietri closed this Oct 2, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Oct 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add safety model test to modelgauge
3 participants