Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451

tsunamit · 2024-06-18T23:08:46Z

No description provided.

…r client

…vt/lgeval

github-actions · 2024-06-18T23:08:59Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

bkorycki

This is a good start! The main issue I see is in how the test creates test items and how LlamaGuard2SUT processes them.

Right now the test produces ChatPrompts, but LlamaGuard2SUT can only handle TextPrompts.
There are two layers of prompt formatting (in the test and by LlamaGuardAnnotator's default formatter in translate_request())

I think we can simply things by pulling all of the formatting into LlamaGuard2SUT. This would involve:

The test produces TextPrompt(text=assistant_response)
Define a custom formatting function (if you don’t want to use LlamaGuardAnnotator’s default formatter).
Initialize LlamaGuard2SUT’s llama_guard_client with your custom formatter.

plugins/safety_models/modelgauge/tests/eval_model_test.py

plugins/safety_models/modelgauge/suts/llamaguard2_sut.py

plugins/safety_models/modelgauge/tests/eval_model_test.py

tsunamit · 2024-06-24T21:39:21Z

@bkorycki question regarding chat vs text prompts.

llama guard 2 as a safety model produces text responses
we have other safety models (llama 3 70b, mistral) that product chat responses

Is it possible to have a single test handle both? or do I need to create 2 different tests

The only difference I recall making was to swap requrest_sut_capabilities

(edit) Oh shoot, realized there are several locations where I had to change it to Chat objects

bkorycki · 2024-06-24T23:35:09Z

Right now tests have to choose between producing test items that either have TextPrompts or ChatPrompts. It is a SUT's responsibility to handle the different prompt types by implements translate_text_prompt and/or translate_chat_prompt. If a model only accepts chat prompts, then you can you just transform the test's TextPrompt into chat format in translate_text_prompt.

tsunamit · 2024-06-26T00:45:07Z

Here's the refactor and proposed design decision so far

Decisions and assumptions

Safety Models are SUTs- decided to keep this paradigm for now. Creating a new runner and new AnnotationTestItem base classes is a viable approach, but one I'm electing to defer for simplicity
Create 2 separate test types - LlamaGuard tests and Chat model tests are just 2 separate tests. They share reusable logic via util functions

Deferred + to address in future

Disentangling lg2 annotator and lg2 SUT, creating a more logical relationship if sharing code
Refactor entire system to use new runner for annotator evals

bkorycki · 2024-06-26T21:35:27Z

I guess my concern with this approach is that tests are built/applied to only one type of a safety SUT. So for every new evaluator we want to test, we need to create 1) a new SafetyModelSUT to transform the annotator into a SUT and 2) a new test with custom processing of the model’s response.

This process increases the amount of effort required to actually use the framework. So --and maybe I'm missing something-- I’m not sure what benefits are gained by awkwardly forcing evaluators in SUT classes.

Using a custom runner would allow new evaluators to be tested without any additional work. Setting this up would just require one new test and the custom runner, which only needs a few modifications to the simple_test_runner.

tsunamit · 2024-06-27T18:00:22Z

Good points. After refactoring, here are the design decisions

Design decisions

Safety models are Annotators.
Evaluation of annotators uses it's own annotator runner. No longer need to create SUTs that map to Annotator. Thanks to runner separation, we don't need to worry about conflicting with SUT tests, they're completely separate

Tests and annotators supported

Safety model test for LlamaGuard2 Annotator
Safety model test for chat models e.g. Llama 3 70b chat

Todo

Add Mistral 8x22b instruct annotator

modelgauge/records.py

bkorycki · 2024-07-01T20:56:22Z

I think this looks a lot nicer! I think there's just one high-level issue remaining before we can merge this in: being able to run one test class on multiple annotators. This test always expects a specific interface to an annotation,SafetyModelResponse, regardless of the annotator. Therefore it makes more sense to make the annotators responsible for translating their native annotation into the expected SafetyModelResponse.
This involves:

Annotators to "test" are specified at runtime. The runner accepts an annotator(s) param, parallel to the simple runner's SUT param. The test does not return anything in get_annotators().
Annotators must return a SafetyModelResponse. We can add a formal check in the runner in the future.
- In order to test the LlamaGuardAnnotator, we need to just add a new class that extends it and only overrides it's translate_response(). Should be just a couple lines.
Only use one test class.

modelgauge/prompt_formatting.py

bkorycki · 2024-07-03T16:13:36Z

plugins/safety_models/modelgauge/tests/safety_model_test.py

+
+@modelgauge_test(requires_sut_capabilities=[AcceptsChatPrompt])
+class SafetyModelTest(PromptResponseTest):
+    def __init__(self, uid: str, together_api_key: TogetherApiKey):


I don't think this test requires any api keys now.

Co-authored-by: Ryan Tsang <[email protected]>

…vt/lgeval

tsunamit added 8 commits June 10, 2024 13:51

Init first working test

968e281

Enable system chat role. Add mistral 8x22b and llama 3 70b to togethe…

20a5e2d

…r client

undo debug code

11d9854

Black formatting

1bba2b3

Update typing issues

7977661

Merge branch 'main' of https://github.com/mlcommons/modelgauge into r…

3c7a349

…vt/lgeval

Add initial safety model test files

d5e8793

typing issues

24b9ed2

tsunamit requested a review from a team as a code owner June 18, 2024 23:08

fix glob search bug

2b062d9

bkorycki suggested changes Jun 21, 2024

View reviewed changes

tsunamit added 2 commits June 24, 2024 13:44

Add readme

6d7ac4d

Add comments and todos

53fbebe

This was referenced Jun 24, 2024

Add local dataset path param to safety model test #452

Open

Add human-annotator agreement metric to safety model eval test #453

Open

tsunamit added 2 commits June 24, 2024 14:21

use inject secret

d594d72

clarify lg1 responses

9c06361

tsunamit added 5 commits June 25, 2024 15:59

create chat test

ddd61b3

Create constants file

01f48a2

Refactor utils and classes

c7d963b

Add second test

11f3889

Split into 2 separate tests

841ec79

update readme

f2c6e51

tsunamit linked an issue Jun 26, 2024 that may be closed by this pull request

Add safety model test to modelgauge #456

Open

Use runner instead of test for llama guard 2

507b8c7

tsunamit added 2 commits June 27, 2024 10:49

remove unrelated steps in readme

3c85d05

Move some constants around

9bffeb2

update ref

9715eba

tsunamit changed the title ~~Add safety model eval test for use with LlamaGuard2 and together hosted models~~ Add annotator test runner + LlamaGuard2 annotator test Jun 27, 2024

tsunamit added 2 commits July 1, 2024 08:53

Add chat test

d95738c

Remove comments

c1f030c

tsunamit changed the title ~~Add annotator test runner + LlamaGuard2 annotator test~~ Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test Jul 1, 2024

bkorycki reviewed Jul 1, 2024

View reviewed changes

modelgauge/records.py Outdated Show resolved Hide resolved

Consolidate into single test

8f50685

tsunamit commented Jul 2, 2024

View reviewed changes

modelgauge/prompt_formatting.py Outdated Show resolved Hide resolved

tsunamit added 6 commits July 2, 2024 13:27

add uids for tests and mypy

8c399de

move method to utils

eb21393

uid update

fa6d39e

uid updates again

68e4453

add main

a23abf0

add uid to references

ec294b0

bkorycki reviewed Jul 3, 2024

View reviewed changes

tsunamit and others added 8 commits July 3, 2024 09:31

remove unneeded param

1768f51

Return custom test record, remove unneeded deps

de9e0bf

Create base annotator test for annotator runner (#466)

25da305

Co-authored-by: Ryan Tsang <[email protected]>

Add support for textprompt, fix type bug when calling measure_quality

d69ef25

Add text prompt support to llama 3 annotator

ecad887

Add readme instructions to run alpha version

cdbbe3d

Merge branch 'main' of https://github.com/mlcommons/modelgauge into r…

99f2d0b

…vt/lgeval

reset records

4070eb6

tsunamit requested a review from bkorycki July 16, 2024 01:06

wpietri closed this Oct 2, 2024

github-actions bot locked and limited conversation to collaborators Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451

Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451

tsunamit commented Jun 18, 2024

github-actions bot commented Jun 18, 2024 •

edited

Loading

bkorycki left a comment

tsunamit commented Jun 24, 2024 •

edited

Loading

bkorycki commented Jun 24, 2024

tsunamit commented Jun 26, 2024

bkorycki commented Jun 26, 2024

tsunamit commented Jun 27, 2024 •

edited

Loading

bkorycki commented Jul 1, 2024

bkorycki Jul 3, 2024

Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451

Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451

Conversation

tsunamit commented Jun 18, 2024

github-actions bot commented Jun 18, 2024 • edited Loading

bkorycki left a comment

Choose a reason for hiding this comment

tsunamit commented Jun 24, 2024 • edited Loading

bkorycki commented Jun 24, 2024

tsunamit commented Jun 26, 2024

Decisions and assumptions

Deferred + to address in future

bkorycki commented Jun 26, 2024

tsunamit commented Jun 27, 2024 • edited Loading

Design decisions

Tests and annotators supported

Todo

bkorycki commented Jul 1, 2024

bkorycki Jul 3, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 18, 2024 •

edited

Loading

tsunamit commented Jun 24, 2024 •

edited

Loading

tsunamit commented Jun 27, 2024 •

edited

Loading