[LLM Inference] Refactor BlockInferencePredictor #8879

yuanlehome · 2024-08-06T09:11:39Z

PR types

Others

PR changes

Others

Description

重构一些代码以及修复一些bug。

特别需要提示的是，静态图推理时指定了src_length和max_length的话，动转静的时候同样也需要指定。

paddle-bot · 2024-08-06T09:11:44Z

Thanks for your contribution!

codecov · 2024-08-06T09:43:54Z

Codecov Report

Attention: Patch coverage is 5.55556% with 17 lines in your changes missing coverage. Please review.

Project coverage is 55.40%. Comparing base (678843e) to head (88ea827).
Report is 246 commits behind head on develop.

Files with missing lines	Patch %	Lines
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	11 Missing ⚠️
paddlenlp/experimental/model_utils.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8879      +/-   ##
===========================================
+ Coverage    55.29%   55.40%   +0.10%     
===========================================
  Files          631      632       +1     
  Lines        98888    99762     +874     
===========================================
+ Hits         54681    55271     +590     
- Misses       44207    44491     +284

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…nto support_llama3

yuanlehome · 2024-08-08T08:57:09Z

llm/predict/export_model.py

@@ -61,7 +61,11 @@ def main():
        },
    )
    predictor.model.config.save_pretrained(export_args.output_path)
-    predictor.model.generation_config.save_pretrained(export_args.output_path)
+    if predictor.generation_config is not None:


修复generation_config.json保存不正确的bug

yuanlehome · 2024-08-08T08:57:27Z

csrc/generation/get_padding_offset_v2.cu

@@ -103,7 +103,7 @@ std::vector<paddle::DataType> GetPaddingOffsetV2InferDtype(const paddle::DataTyp
 }

 PD_BUILD_OP(get_padding_offset_v2)
-    .Inputs({"input_ids", "token_num", "cum_offsets", "seq_len"})
+    .Inputs({"input_ids", "cum_offsets", "token_num", "seq_len"})


修复算子输入位置错乱问题

很好奇之前怎么能跑通

影响name顺序而已，tensor顺序没有错

yuanlehome · 2024-08-08T08:57:49Z

csrc/generation/encode_rotary_qk.cu

@@ -172,15 +172,15 @@ void LaunchRotaryQK(const paddle::Tensor& q,
            head_num,
            seq_len * rotary_emb_dims,
            last_dim);
-        NeoXRotaryKernel<<<grid, BlockSize, 0, cu_stream>>>(
+        NeoXRotaryKernel<<<grid_k, BlockSize, 0, cu_stream>>>(


修复gqa kernel计算bug

yuanlehome · 2024-08-08T08:59:03Z

llm/predict/predictor.py

-    get_default_max_decoding_length,
-    get_default_max_encoding_length,


删掉自动配置src_length和max_length逻辑，通过给src_length和max_length指定默认值，因为自动配置并不合理不能兼容所有模型，某些情况下会报错

DesmonDay · 2024-08-08T09:03:46Z

llm/utils/utils.py

@@ -725,58 +724,11 @@ def init_chat_template(
    tokenizer.init_chat_template(chat_template_file)


-def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]:


这些函数不要删掉，我看代码库别的地方也有调用的，例如ppo之类的。

DesmonDay · 2024-08-08T09:22:49Z

llm/predict/predictor.py

    def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer):
+        BasePredictor.__init__(self, config, tokenizer)


所以generation_config就通过BasePredictor来生成是吧

DesmonDay · 2024-08-08T09:27:18Z

llm/predict/predictor.py

+        self.input_ids = paddle.full(
+            shape=[config.batch_size, config.total_max_length], fill_value=self.tokenizer.pad_token_id, dtype="int64"
+        )
+        self.model_inputs = {}


变量名从 self.inputs 改为了 self.model_inputs，会影响什么逻辑吗？

DesmonDay · 2024-08-08T09:30:25Z

llm/predict/predictor.py

-            length = len(input_ids)
-            self.inputs["input_ids"][i : i + 1, :length] = input_ids
-            self.inputs["penalty_score"][i : i + 1] = self.config.repetition_penalty
-            self.inputs["frequency_score"][i : i + 1] = 0.0


这些不需要保留么？

还是说前面 init_model_inputs 写了，这里就是把重复的删掉了

对的，不需要保留

DesmonDay · 2024-08-08T09:33:43Z

paddlenlp/experimental/transformers/llama/modeling.py


-            qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight)
+            qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight).cast(paddle.get_default_dtype())


为啥要cast到 default dtype

为了同时能跑bf16/fp16

DesmonDay · 2024-08-08T09:34:01Z

paddlenlp/experimental/transformers/llama/modeling.py

-            linear_weight_tensor = paddle.to_tensor(state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)])
+            linear_weight_tensor = paddle.to_tensor(
+                state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)]
+            ).cast(paddle.get_default_dtype())


DesmonDay · 2024-08-08T09:36:41Z

llm/predict/predictor.py

-    src_length: int = field(default=None, metadata={"help": "The max length of source text."})
-    max_length: int = field(default=None, metadata={"help": "the max length for decoding."})
+    src_length: int = field(default=4096, metadata={"help": "The max length of source text."})
+    min_length: int = field(default=1, metadata={"help": "the min length for decoding."})


min_length和max_length，名字改成 min_decode_length， max_decode_length吧，不然容易引起误会。

改不动，太多地方用到这个名字了ooo

DesmonDay · 2024-08-08T09:37:02Z

tests/llm/test_predictor.py

@@ -231,9 +253,9 @@ def setUp(self) -> None:
        AutoTokenizer.from_pretrained(self.model_name_or_path).save_pretrained(self.output_dir)

    def test_blha(self):
-        self.run_predictor({"inference_model": True, "block_attn": True})
+        self.run_predictor({"inference_model": True, "block_attn": True, "src_length": 1024, "max_length": 48})


max_length改成max_decode_length

vivienfanghuagood · 2024-08-08T09:54:46Z

llm/predict/predictor.py

+            shape=[config.batch_size, 1], fill_value=config.temperature, dtype="float32"
+        )
+        self.model_inputs["eos_token_id"] = paddle.to_tensor(
+            np.array(get_eos_token_id(self.tokenizer, self.generation_config)).reshape(-1, 1).astype("int64")


这个在多batch下不会有问题吗？

不会的，kernel处理的时候不分batch的

wawltor

LGTM

stage 1

e410e4a

yuanlehome mentioned this pull request Aug 6, 2024

TEMP #8855

Closed

yuanlehome added 4 commits August 6, 2024 14:54

update

6f9d819

update

244802a

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

bc75f59

…nto support_llama3

fix ci

147f2c1

yuanlehome changed the title ~~TEMP~~ [LLM Inference] Refactor BlockInferencePredictor Aug 8, 2024

yuanlehome commented Aug 8, 2024

View reviewed changes

DesmonDay reviewed Aug 8, 2024

View reviewed changes

vivienfanghuagood reviewed Aug 8, 2024

View reviewed changes

fix ci

0946ab9

yuanlehome force-pushed the support_llama3 branch from 87f5510 to 0946ab9 Compare August 8, 2024 12:30

fix ut

88ea827

wawltor approved these changes Aug 12, 2024

View reviewed changes

wawltor merged commit 5bc040a into PaddlePaddle:develop Aug 12, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM Inference] Refactor BlockInferencePredictor #8879

[LLM Inference] Refactor BlockInferencePredictor #8879

yuanlehome commented Aug 6, 2024 •

edited

Loading

paddle-bot bot commented Aug 6, 2024

codecov bot commented Aug 6, 2024 •

edited

Loading

yuanlehome Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 9, 2024

yuanlehome Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

DesmonDay Aug 8, 2024

yuanlehome Aug 8, 2024

vivienfanghuagood Aug 8, 2024

yuanlehome Aug 8, 2024

wawltor left a comment

		get_default_max_decoding_length,
		get_default_max_encoding_length,

		@@ -725,58 +724,11 @@ def init_chat_template(
		tokenizer.init_chat_template(chat_template_file)


		def get_model_max_position_embeddings(config: PretrainedConfig) -> Optional[int]:

		def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer):
		BasePredictor.__init__(self, config, tokenizer)


		qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight)
		qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight).cast(paddle.get_default_dtype())

[LLM Inference] Refactor BlockInferencePredictor #8879

[LLM Inference] Refactor BlockInferencePredictor #8879

Conversation

yuanlehome commented Aug 6, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Aug 6, 2024

codecov bot commented Aug 6, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

yuanlehome commented Aug 6, 2024 •

edited

Loading

codecov bot commented Aug 6, 2024 •

edited

Loading