-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM Inference] Refactor BlockInferencePredictor #8879
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8879 +/- ##
===========================================
+ Coverage 55.29% 55.40% +0.10%
===========================================
Files 631 632 +1
Lines 98888 99762 +874
===========================================
+ Hits 54681 55271 +590
- Misses 44207 44491 +284 ☔ View full report in Codecov by Sentry. |
@@ -61,7 +61,11 @@ def main(): | |||
}, | |||
) | |||
predictor.model.config.save_pretrained(export_args.output_path) | |||
predictor.model.generation_config.save_pretrained(export_args.output_path) | |||
if predictor.generation_config is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复generation_config.json保存不正确的bug
@@ -103,7 +103,7 @@ std::vector<paddle::DataType> GetPaddingOffsetV2InferDtype(const paddle::DataTyp | |||
} | |||
|
|||
PD_BUILD_OP(get_padding_offset_v2) | |||
.Inputs({"input_ids", "token_num", "cum_offsets", "seq_len"}) | |||
.Inputs({"input_ids", "cum_offsets", "token_num", "seq_len"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复算子输入位置错乱问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
很好奇之前怎么能跑通
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
影响name顺序而已,tensor顺序没有错
@@ -172,15 +172,15 @@ void LaunchRotaryQK(const paddle::Tensor& q, | |||
head_num, | |||
seq_len * rotary_emb_dims, | |||
last_dim); | |||
NeoXRotaryKernel<<<grid, BlockSize, 0, cu_stream>>>( | |||
NeoXRotaryKernel<<<grid_k, BlockSize, 0, cu_stream>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复gqa kernel计算bug
get_default_max_decoding_length, | ||
get_default_max_encoding_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉自动配置src_length和max_length逻辑,通过给src_length和max_length指定默认值,因为自动配置并不合理不能兼容所有模型,某些情况下会报错
llm/utils/utils.py
Outdated
@@ -725,58 +724,11 @@ def init_chat_template( | |||
tokenizer.init_chat_template(chat_template_file) | |||
|
|||
|
|||
def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些函数不要删掉,我看代码库别的地方也有调用的,例如ppo之类的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer): | ||
BasePredictor.__init__(self, config, tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所以generation_config就通过BasePredictor来生成是吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的
self.input_ids = paddle.full( | ||
shape=[config.batch_size, config.total_max_length], fill_value=self.tokenizer.pad_token_id, dtype="int64" | ||
) | ||
self.model_inputs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名从 self.inputs 改为了 self.model_inputs,会影响什么逻辑吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会的
length = len(input_ids) | ||
self.inputs["input_ids"][i : i + 1, :length] = input_ids | ||
self.inputs["penalty_score"][i : i + 1] = self.config.repetition_penalty | ||
self.inputs["frequency_score"][i : i + 1] = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些不需要保留么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还是说前面 init_model_inputs 写了,这里就是把重复的删掉了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对的,不需要保留
|
||
qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight) | ||
qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight).cast(paddle.get_default_dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为啥要cast到 default dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为了同时能跑bf16/fp16
linear_weight_tensor = paddle.to_tensor(state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)]) | ||
linear_weight_tensor = paddle.to_tensor( | ||
state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)] | ||
).cast(paddle.get_default_dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同问
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同回
src_length: int = field(default=None, metadata={"help": "The max length of source text."}) | ||
max_length: int = field(default=None, metadata={"help": "the max length for decoding."}) | ||
src_length: int = field(default=4096, metadata={"help": "The max length of source text."}) | ||
min_length: int = field(default=1, metadata={"help": "the min length for decoding."}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_length和max_length,名字改成 min_decode_length, max_decode_length吧,不然容易引起误会。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改不动,太多地方用到这个名字了ooo
tests/llm/test_predictor.py
Outdated
@@ -231,9 +253,9 @@ def setUp(self) -> None: | |||
AutoTokenizer.from_pretrained(self.model_name_or_path).save_pretrained(self.output_dir) | |||
|
|||
def test_blha(self): | |||
self.run_predictor({"inference_model": True, "block_attn": True}) | |||
self.run_predictor({"inference_model": True, "block_attn": True, "src_length": 1024, "max_length": 48}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_length改成max_decode_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改不动
shape=[config.batch_size, 1], fill_value=config.temperature, dtype="float32" | ||
) | ||
self.model_inputs["eos_token_id"] = paddle.to_tensor( | ||
np.array(get_eos_token_id(self.tokenizer, self.generation_config)).reshape(-1, 1).astype("int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个在多batch下不会有问题吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会的,kernel处理的时候不分batch的
87f5510
to
0946ab9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Description
重构一些代码以及修复一些bug。
特别需要提示的是,静态图推理时指定了src_length和max_length的话,动转静的时候同样也需要指定。