Maybe Wrong implementation of AttentionWithRoPE for GPTJ and GPT-NeoX? #747

PanQiWei · 2023-08-12T14:58:49Z

I think there may be a wrong implementation for GPTJ and GPT-NeoX when doing 'apply rotary embedding'.

Currently implemented PagedAttentionWithRoPE always use the whole query and key, which is compatible with models like llama and baichuan, however for GPTJ and GPT-NeoX they may only use part of query and key when doing rope.

If current implementation does not compatible with those two models, I would suggest for those models that not using the whole query and key when applying rotary embeddings, can have another attention class that inherit from PagedAttentionWithRoPE and do something like:

query_rot, query_pass = self._prepare_tensor_for_rope(query, self.rotary_dim)
key_rot, key_pass = self._prepare_tensor_for_rope(key, self.rotary_dim)
pos_encoding_ops.rotary_embedding_neox(
    positions,
    query_rot,
    key_rot,
    self.rotary_dim,
    self.cos_sin_cache,
)
query = self._cat_tensor_after_rope(query_rot, query_pass)
key = self._cat_tensor_after_rope(key_rot, key_pass)

The text was updated successfully, but these errors were encountered:

syskn · 2023-08-12T17:11:58Z

The implementation looks correct to me. pos_encoding_kernels.cu uses int rot_dim = cos_sin_cache.size(1);
which is determined by rotary_dim passed from

rotary_dim = int(self.head_size * config.rotary_pct)
self.attn = PagedAttentionWithRoPE(self.num_heads, self.head_size, scaling, rotary_dim)

in GPTNeoXAttention @ vllm/model_executor/models/gpt_neox.py

Strange inference results has especially been reported for GPTJ though: #590

PanQiWei · 2023-08-12T17:21:22Z

I think one of mainly problem is how to rotate tensor. Referenced from HF transformers' implementation, for GPT-J it's rotate_every_two and for GPT-NeoX or LLaMa it's rotate_half, which will cause different results.

PanQiWei · 2023-08-12T17:25:42Z

I haven't read the .cu code yet. But if my understanding is correct, it should always get and apply pos embeds to the top rot_dim columns in the tensor's last dimension, if so I think GPT-NeoX's implementation should be correct.

lucasjinreal · 2023-08-15T02:25:09Z

It should be addressed, since vllm output from many user's side are not aligned with hf, from the result it become more stupid than hf's output.

PanQiWei · 2023-08-15T05:57:51Z

Addition notes, I think one can claim a vLLM model's generation quality is worse than HF's only when they doing following things:

loading the model from .bin or .pt file instead of .safetensors file;
done a ppl benchmark and confirm there is a big difference between two models from different framework
using beam search decode strategy rather than sampling decode strategy

lucasjinreal · 2023-08-15T06:19:30Z

@PanQiWei

How to make sure beam search enabled in vllm?

PanQiWei · 2023-08-15T06:26:20Z

@PanQiWei

How to make sure beam search enabled in vllm?

Turn the flag use_beam_search on in SamplingParams or your request payload. And make sure n > 1(thus vLLM not support greedy search)

lucasjinreal · 2023-08-15T09:36:57Z

@PanQiWei Does n = 2 means using 2 beam search? Does it work in stream mode?

PanQiWei · 2023-08-16T07:54:31Z

@PanQiWei Does n = 2 means using 2 beam search? Does it work in stream mode?

sorry my bad, in openai api it should best_of=2 aka beam_size=2; and I don't think it work in stream mode.

lucasjinreal · 2023-08-16T08:22:57Z

@PanQiWei So there still have some bias between hf and vllm.

WoosukKwon · 2023-09-04T15:36:25Z

Hi @PanQiWei @lucasjinreal @syskn , thanks for letting us know the bug and the solution. As you pointed out, I misunderstood the rotary embedding in GPT-J and treated it equal to the RoPE used by GPT-NeoX. #941 fixes the bug. Apologies for the confusion and inconvenience.

PanQiWei changed the title ~~Wrong implementation of GPTJ and GPT-NeoX~~ Maybe Wrong implementation of AttentionWithRoPE for GPTJ and GPT-NeoX? Aug 12, 2023

PanQiWei mentioned this issue Aug 12, 2023

GPTJ output not consistent with that of transformers #590

Closed

WoosukKwon added the bug Something isn't working label Sep 1, 2023

WoosukKwon mentioned this issue Sep 4, 2023

[BugFix] Implement RoPE for GPT-J #941

Merged

PanQiWei closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe Wrong implementation of AttentionWithRoPE for GPTJ and GPT-NeoX? #747

Maybe Wrong implementation of AttentionWithRoPE for GPTJ and GPT-NeoX? #747

PanQiWei commented Aug 12, 2023 •

edited

Loading

syskn commented Aug 12, 2023 •

edited

Loading

PanQiWei commented Aug 12, 2023

PanQiWei commented Aug 12, 2023 •

edited

Loading

lucasjinreal commented Aug 15, 2023

PanQiWei commented Aug 15, 2023

lucasjinreal commented Aug 15, 2023

PanQiWei commented Aug 15, 2023 •

edited

Loading

lucasjinreal commented Aug 15, 2023

PanQiWei commented Aug 16, 2023

lucasjinreal commented Aug 16, 2023

WoosukKwon commented Sep 4, 2023 •

edited

Loading

Maybe Wrong implementation of AttentionWithRoPE for GPTJ and GPT-NeoX? #747

Maybe Wrong implementation of AttentionWithRoPE for GPTJ and GPT-NeoX? #747

Comments

PanQiWei commented Aug 12, 2023 • edited Loading

syskn commented Aug 12, 2023 • edited Loading

PanQiWei commented Aug 12, 2023

PanQiWei commented Aug 12, 2023 • edited Loading

lucasjinreal commented Aug 15, 2023

PanQiWei commented Aug 15, 2023

lucasjinreal commented Aug 15, 2023

PanQiWei commented Aug 15, 2023 • edited Loading

lucasjinreal commented Aug 15, 2023

PanQiWei commented Aug 16, 2023

lucasjinreal commented Aug 16, 2023

WoosukKwon commented Sep 4, 2023 • edited Loading

PanQiWei commented Aug 12, 2023 •

edited

Loading

syskn commented Aug 12, 2023 •

edited

Loading

PanQiWei commented Aug 12, 2023 •

edited

Loading

PanQiWei commented Aug 15, 2023 •

edited

Loading

WoosukKwon commented Sep 4, 2023 •

edited

Loading