[RFC]: Reward Modelling in OpenAI Compatible Server #8967

noamgat · 2024-09-30T13:45:35Z

Motivation.

Reward models are an important tool in NLP / AI workflows, especially in agentic flows which use them to verify quality of intermediate outputs, or rank between several attempts at performing a single task.

vLLM just added support for a reward model in #8896 (comment) .
This requires a workaround in order to work with the OpenAI Compatible Server - it piggybacks the embedding endpoint.
The workaround requires the client to know which tokenizer is being used by the server, apply the chat template to the conversation, and send the resulting string to the embedding endpoint. This isn't ideal and it breaks the decoupling between the client and server.

A short discussion in the same issue led to the creation of the RFC.

Proposed Change.

The reason that no endpoint currently matches the needs of the reward model is as follows:

The embedding endpoint receives a string as the input, not a conversation
The chat endpoint returns a string, not a series of numbers. Even if you ask for logprobs, they are after softmax was applied, which is not a reversable process.

I see several ways to more elegantly support reward models in the OpenAI compatible server, and this RFC will hopefully be the discussion point for them.

Option 1:
Add a conversation object (List[Dict[str, str]]) as a potential input to EmbeddingRequest class. It already supports a variety of options:
input: Union[List[int], List[List[int]], str, List[str]]
Upon detecting that a conversation object was given, the OpenAI Compatible server will apply the chat template using the tokenizer, and proceed as if it received str input.

Option 2:
Add a way to get output logits instead of output logprobs from the chat endpoint. This can be either a new per-request parameter (similar to top_logprobs) or a server-side flag to override the behavior of the data returned in the field (--return_logits_instead_of_logprobs flag to the OpenAI server for example).

Option 3:
Add a dedicated endpoint to vLLM.

Option 4:
Do nothing. Since there is a /tokenize endpoint that also accepts a conversation, the sample code in #8896 (comment) could be changed to use the tokenize endpoint, receieve the tokens list and send that the embeddings endpoint, which addresses the coupling problem.

I personally support Option 1, as it feels the least hacky of the bunch, and also does not require a whole lot of new code.

What do you think?

Feedback Period.

Not my decision to say when a conclusion was reached here, but I don't think it should take more than a couple of weeks.

CC List.

@simon-mo
@DarkLight1337
@zhuzilin

Any Other Things.

vLLM is awesome!

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-09-30T14:51:11Z

To keep the semantics consistent (where Chat Completions is like Completions API with chat conversation), I prefer having a separate Chat Embeddings API (Embeddings API with chat conversation). So our endpoint map would be something like

v1/completions (Completions API)
v1/chat/completions (Chat Completions API)
v1/embeddings (Embeddings API)
v1/chat/embeddings (Chat Embeddings API) [new]

Since most of the logic related to chat conversation parsing is already in chat_utils.py, it should not take that much effort to add this.

simon-mo · 2024-09-30T18:23:24Z

Thank you for the RFC! Adding @youkaichao as stakeholder and please feel free to add others

natolambert · 2024-10-01T01:33:47Z

I'm not a VLLM contributor (at least heavily, I may have had a PR I don't remember), but I'm a heavy reward model user and a heavy infrastructure builder (you can see my basic pipelines for most public reward models on HuggingFace in RewardBench).

I do not think getting reward models right will be easy, but it is worthwhile and a sign of a maturing ecosystem. Some things to keep in mind, and I think eventually a dedicated architecture will be worthwhile. This goes in line with a few common use-cases for RMs.

LLM-as-a-judge / evals: This is likely the biggest use at first. Rating responses for filtering. In this case, you normally will not be using the text at token level. Outputting a score is all you need. Hacky solutions are fine (iirc option 1 above).
RLHF training (synchronous, e.g. PPO): Here, the reward model scores candidate samples and this value is used to update the LM loss. The weights of the RM are held constant. Though, this likely just works on tokens and re-applying the chat template is slow. Though, being able to switch chat templates easily is very nice for the open community, who may be training Llama 3.1 but using the Qwen RM (different tokenizer).
RLHF training (asynch, e.g. rejection sampling): Here, the RM likely can be used as an OpenAI server. We just need to pass a big list of texts through the reward model. This is very similar to LLM as a judge substitutes, but the description is different :)

Some comments:

Passing a "messages" List[Dict[str, str]] I suspect would be used less than just List[str] as people have normally formatted their messages before passing into the RM, but that may just be me.

Can someone provide more examples on how the embedding API and the existing Qwen model work? I didn't see that much in the PR.

THANKS!

youkaichao · 2024-10-01T04:10:53Z

cc @zhuzilin

zhuzilin · 2024-10-01T04:31:32Z

@natolambert please take a look at the pr description: #8896

noamgat · 2024-10-01T06:06:18Z

I'm not a VLLM contributor (at least heavily, I may have had a PR I don't remember), but I'm a heavy reward model user and a heavy infrastructure builder (you can see my basic pipelines for most public reward models on HuggingFace in RewardBench).

I do not think getting reward models right will be easy, but it is worthwhile and a sign of a maturing ecosystem. Some things to keep in mind, and I think eventually a dedicated architecture will be worthwhile. This goes in line with a few common use-cases for RMs.
1. LLM-as-a-judge / evals: This is likely the biggest use at first. Rating responses for filtering. In this case, you normally will not be using the text at token level. Outputting a score is all you need. Hacky solutions are fine (iirc option 1 above).

2. RLHF training (synchronous, e.g. PPO): Here, the reward model scores candidate samples and this value is used to update the LM loss. The weights of the RM are held constant. Though, this likely just works on tokens and re-applying the chat template is slow. Though, being able to switch chat templates easily is very nice for the open community, who may be training Llama 3.1 but using the Qwen RM (different tokenizer).

3. RLHF training (asynch, e.g. rejection sampling): Here, the RM likely can be used as an OpenAI server. We just need to pass a big list of texts through the reward model. This is very similar to LLM as a judge substitutes, but the description is different :)
Some comments:
* Passing a "messages" `List[Dict[str, str]]` I suspect would be used less than just `List[str]` as people have normally formatted their messages before passing into the RM, but that may just be me.
Can someone provide more examples on how the embedding API and the existing Qwen model work? I didn't see that much in the PR.

THANKS!

Thanks for chipping in! I really appreciate your work on RewardBench!

I agree with your division into the three main usecases. I think all three usecases can be covered by all three options I listed, so its a matter of elegance, simplicity and maintainability IMO.
I think List[dict] is better than List[str] to support system turns at the API level (is the first message a user or system message?) It also follows the current API patterns better.

natolambert · 2024-10-01T20:50:00Z

@zhuzilin I think the initial implementation is good at a quick pass. It covers the biggest things.
(mostly acknowledging that I did, but without using it I am unlikely to uncover weird corner cases)

noamgat · 2024-10-02T06:23:34Z

@zhuzilin @youkaichao - which of the approaches sound best to you?

Note that I also added a fourth option, do nothing, and guide the clients to use tokenize(conversation) endpoint and later embeddings endpoint.

zankner · 2024-10-13T22:05:28Z

Not sure how useful this is, but one thought is that reward models will eventually be generative. I did a work on this along with some others (https://arxiv.org/abs/2408.11791, https://arxiv.org/abs/2408.15240). Might be worthwhile to scope out doing both generation and scoring from a single interface.

noamgai21 · 2024-10-14T06:53:20Z

Not sure how useful this is, but one thought is that reward models will eventually be generative. I did a work on this along with some others (https://arxiv.org/abs/2408.11791, https://arxiv.org/abs/2408.15240). Might be worthwhile to scope out doing both generation and scoring from a single interface.

Thanks for pitching in! From looking at the paper, that kind of model can be served with today's chat interface, as text generation + logprobs is enough (From what I see) to use the trained model. Am I wrong?

zankner · 2024-10-15T21:28:46Z

Thats true for the second paper (https://arxiv.org/abs/2408.15240). For the first paper its actually a second linear head that gets called on the hidden state of the eos token generated by the reward model, so can't use logprobs sadly.

arthrod · 2024-10-21T03:31:49Z

Not sure how useful this is, but one thought is that reward models will eventually be generative. I did a work on this along with some others (https://arxiv.org/abs/2408.11791, https://arxiv.org/abs/2408.15240). Might be worthwhile to scope out doing both generation and scoring from a single interface.

nvidia/Llama-3.1-Nemotron-70B-Reward-HF's architecture is LlamaForCausalLM. I was able to use torch to deploy and the inference is working.

DarkLight1337 · 2024-10-22T04:20:19Z

To keep the semantics consistent (where Chat Completions is like Completions API with chat conversation), I prefer having a separate Chat Embeddings API (Embeddings API with chat conversation). So our endpoint map would be something like

v1/completions (Completions API)

v1/chat/completions (Chat Completions API)

v1/embeddings (Embeddings API)

v1/chat/embeddings (Chat Embeddings API) [new]

Since most of the logic related to chat conversation parsing is already in chat_utils.py, it should not take that much effort to add this.

We will add a Chat Embeddings API soon in order to support multi-modal embeddings in online inference. This will also provide support for embeddings from text-only conversations.

Went-Liang · 2024-11-01T10:13:27Z

Not sure how useful this is, but one thought is that reward models will eventually be generative. I did a work on this along with some others (https://arxiv.org/abs/2408.11791, https://arxiv.org/abs/2408.15240). Might be worthwhile to scope out doing both generation and scoring from a single interface.

nvidia/Llama-3.1-Nemotron-70B-Reward-HF's architecture is LlamaForCausalLM. I was able to use torch to deploy and the inference is working.

@arthrod Excuse me, would it be convenient for you to share the script? I encounter an error when testing Llama-3.1-Nemotron-70B-Reward-HF with --task embedding.

noamgat added the RFC label Sep 30, 2024

natolambert mentioned this issue Oct 1, 2024

[Feature]: Support for Seq classification/Reward models #8700

Closed

1 task

simon-mo mentioned this issue Oct 1, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

39 tasks

zhuzilin mentioned this issue Oct 12, 2024

[RFC]: Let every model be a reward model/embedding model for PRMs #9314

Closed

1 task

DarkLight1337 mentioned this issue Oct 28, 2024

[Frontend] Chat-based Embeddings API #9759

Merged

DarkLight1337 closed this as completed in #9759 Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Reward Modelling in OpenAI Compatible Server #8967

[RFC]: Reward Modelling in OpenAI Compatible Server #8967

noamgat commented Sep 30, 2024 •

edited

Loading

DarkLight1337 commented Sep 30, 2024 •

edited

Loading

simon-mo commented Sep 30, 2024

natolambert commented Oct 1, 2024

youkaichao commented Oct 1, 2024

zhuzilin commented Oct 1, 2024

noamgat commented Oct 1, 2024 •

edited

Loading

natolambert commented Oct 1, 2024

noamgat commented Oct 2, 2024 •

edited

Loading

zankner commented Oct 13, 2024

noamgai21 commented Oct 14, 2024

zankner commented Oct 15, 2024

arthrod commented Oct 21, 2024

DarkLight1337 commented Oct 22, 2024

Went-Liang commented Nov 1, 2024

[RFC]: Reward Modelling in OpenAI Compatible Server #8967

[RFC]: Reward Modelling in OpenAI Compatible Server #8967

Comments

noamgat commented Sep 30, 2024 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

DarkLight1337 commented Sep 30, 2024 • edited Loading

simon-mo commented Sep 30, 2024

natolambert commented Oct 1, 2024

youkaichao commented Oct 1, 2024

zhuzilin commented Oct 1, 2024

noamgat commented Oct 1, 2024 • edited Loading

natolambert commented Oct 1, 2024

noamgat commented Oct 2, 2024 • edited Loading

zankner commented Oct 13, 2024

noamgai21 commented Oct 14, 2024

zankner commented Oct 15, 2024

arthrod commented Oct 21, 2024

DarkLight1337 commented Oct 22, 2024

Went-Liang commented Nov 1, 2024

noamgat commented Sep 30, 2024 •

edited

Loading

DarkLight1337 commented Sep 30, 2024 •

edited

Loading

noamgat commented Oct 1, 2024 •

edited

Loading

noamgat commented Oct 2, 2024 •

edited

Loading