-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Reward Modelling in OpenAI Compatible Server #8967
Comments
To keep the semantics consistent (where Chat Completions is like Completions API with chat conversation), I prefer having a separate Chat Embeddings API (Embeddings API with chat conversation). So our endpoint map would be something like
Since most of the logic related to chat conversation parsing is already in |
Thank you for the RFC! Adding @youkaichao as stakeholder and please feel free to add others |
I'm not a VLLM contributor (at least heavily, I may have had a PR I don't remember), but I'm a heavy reward model user and a heavy infrastructure builder (you can see my basic pipelines for most public reward models on HuggingFace in RewardBench). I do not think getting reward models right will be easy, but it is worthwhile and a sign of a maturing ecosystem. Some things to keep in mind, and I think eventually a dedicated architecture will be worthwhile. This goes in line with a few common use-cases for RMs.
Some comments:
Can someone provide more examples on how the embedding API and the existing Qwen model work? I didn't see that much in the PR. THANKS! |
cc @zhuzilin |
@natolambert please take a look at the pr description: #8896 |
Thanks for chipping in! I really appreciate your work on RewardBench!
|
@zhuzilin I think the initial implementation is good at a quick pass. It covers the biggest things. |
@zhuzilin @youkaichao - which of the approaches sound best to you? Note that I also added a fourth option, do nothing, and guide the clients to use tokenize(conversation) endpoint and later embeddings endpoint. |
Not sure how useful this is, but one thought is that reward models will eventually be generative. I did a work on this along with some others (https://arxiv.org/abs/2408.11791, https://arxiv.org/abs/2408.15240). Might be worthwhile to scope out doing both generation and scoring from a single interface. |
Thanks for pitching in! From looking at the paper, that kind of model can be served with today's chat interface, as text generation + logprobs is enough (From what I see) to use the trained model. Am I wrong? |
Thats true for the second paper (https://arxiv.org/abs/2408.15240). For the first paper its actually a second linear head that gets called on the hidden state of the eos token generated by the reward model, so can't use logprobs sadly. |
nvidia/Llama-3.1-Nemotron-70B-Reward-HF's architecture is LlamaForCausalLM. I was able to use torch to deploy and the inference is working. |
We will add a Chat Embeddings API soon in order to support multi-modal embeddings in online inference. This will also provide support for embeddings from text-only conversations. |
@arthrod Excuse me, would it be convenient for you to share the script? I encounter an error when testing Llama-3.1-Nemotron-70B-Reward-HF with |
Motivation.
Reward models are an important tool in NLP / AI workflows, especially in agentic flows which use them to verify quality of intermediate outputs, or rank between several attempts at performing a single task.
vLLM just added support for a reward model in #8896 (comment) .
This requires a workaround in order to work with the OpenAI Compatible Server - it piggybacks the embedding endpoint.
The workaround requires the client to know which tokenizer is being used by the server, apply the chat template to the conversation, and send the resulting string to the embedding endpoint. This isn't ideal and it breaks the decoupling between the client and server.
A short discussion in the same issue led to the creation of the RFC.
Proposed Change.
The reason that no endpoint currently matches the needs of the reward model is as follows:
I see several ways to more elegantly support reward models in the OpenAI compatible server, and this RFC will hopefully be the discussion point for them.
Option 1:
Add a conversation object (
List[Dict[str, str]]
) as a potential input toEmbeddingRequest
class. It already supports a variety of options:input: Union[List[int], List[List[int]], str, List[str]]
Upon detecting that a conversation object was given, the OpenAI Compatible server will apply the chat template using the tokenizer, and proceed as if it received
str
input.Option 2:
Add a way to get output logits instead of output logprobs from the chat endpoint. This can be either a new per-request parameter (similar to
top_logprobs
) or a server-side flag to override the behavior of the data returned in the field (--return_logits_instead_of_logprobs
flag to the OpenAI server for example).Option 3:
Add a dedicated endpoint to vLLM.
Option 4:
Do nothing. Since there is a
/tokenize
endpoint that also accepts a conversation, the sample code in #8896 (comment) could be changed to use thetokenize
endpoint, receieve the tokens list and send that theembeddings
endpoint, which addresses the coupling problem.I personally support Option 1, as it feels the least hacky of the bunch, and also does not require a whole lot of new code.
What do you think?
Feedback Period.
Not my decision to say when a conclusion was reached here, but I don't think it should take more than a couple of weeks.
CC List.
@simon-mo
@DarkLight1337
@zhuzilin
Any Other Things.
vLLM is awesome!
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: