Only Apply the TP in language_model #219

yuanwu2017 · 2024-09-04T08:34:21Z

What does this PR do?

Fix the llava-next crash with multi-cards.

Depend on:
huggingface/optimum-habana#1309

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-09-04T08:42:45Z

@mandy-li @tthakkal Please review.

regisss

LGTM!
I just merged huggingface/optimum-habana#1309.
@tthakkal @mandy-li Waiting for your approval before merging.

tthakkal · 2024-09-04T16:06:32Z

@yuanwu2017 the problem is on habana-main branch we don't use OH main to build, we use specific OH release. Not sure how we should merge this change @mandy-li.

mandy-li

LGTM

regisss · 2024-09-04T18:59:04Z

@yuanwu2017 the problem is on habana-main branch we don't use OH main to build, we use specific OH release. Not sure how we should merge this change @mandy-li.

I can do a patch release of Optimum Habana if you need this now.

mandy-li · 2024-09-04T19:05:29Z

@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ?

if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table

tthakkal · 2024-09-05T00:38:06Z

@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ?

if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table

@regisss Please also include OH commit huggingface/optimum-habana@7e4d7f1 in OH patch release and also include TGI PR #220

yuanwu2017 · 2024-09-05T00:51:56Z

@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ?
Don't include 217. the 217 depends on huggingface/optimum#2003.
if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table
Ok.

hchauhan123 · 2024-09-05T02:17:41Z

@yuanwu2017 I tried your PR with below docker command (bf16). I also tried fp8 and fails with same error during warmup:

docker run -it --rm -p 8085:80 --runtime=habana -v /sys/kernel/debug:/sys/kernel/debug -v /tmp:/tmp -e HUGGING_FACE_HUB_TOKEN=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 -e HABANA_VISIBLE_DEVICES=0,1 -e PT_HPU_RECIPE_CACHE_CONFIG=/root/data2/tgi_cache/tgi-llava_v1_6/mistral_cache,false,8192 -e DBG_TRACE_FILENAME=/root/data2/tgi_cache/tgi-llava_v1_6/llava_v1_6_logfile.log -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=1 --cap-add=sys_nice --ipc=host --name llava_pr219 tgi_gaudi_hsc_pr219_pr220:latest --model-id llava-hf/llava-v1.6-mistral-7b-hf --sharded true --num-shard 2 --max-input-tokens 4096 --max-batch-prefill-tokens 16384 --max-total-tokens 8192 --max-batch-total-tokens 32768

and it fails with below error:

2024-09-05T02:13:33.896967Z  INFO text_generation_router: router/src/main.rs:516: Serving revision 216670a16460adb7c41ce3e123ceb3859f73ab12 of model llava-hf/llava-v1.6-mistral-7b-hf
2024-09-05T02:13:33.936579Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<image>' was expected to have ID '32000' but was given ID 'None'
2024-09-05T02:13:33.936589Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<pad>' was expected to have ID '32001' but was given ID 'None'
2024-09-05T02:13:33.936735Z  INFO text_generation_router: router/src/main.rs:317: Using config Some(LlavaNext(LlavaNext { text_config: TextConfig, vision_config: VisionConfig { image_size: 336, patch_size: 14 }, image_grid_pinpoints: [(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)] }))
2024-09-05T02:13:33.941632Z  INFO text_generation_router: router/src/main.rs:345: Warming up model
2024-09-05T02:13:46.022225Z ERROR warmup{max_input_length=4096 max_prefill_tokens=16384 max_total_tokens=8192 max_batch_size=Some(4) model_id="llava-hf/llava-v1.6-mistral-7b-hf"}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-09-05T02:13:46.486976Z ERROR warmup{max_input_length=4096 max_prefill_tokens=16384 max_total_tokens=8192 max_batch_size=Some(4) model_id="llava-hf/llava-v1.6-mistral-7b-hf"}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
Error: Warmup(Generation("CANCELLED"))
2024-09-05T02:13:46.514321Z ERROR text_generation_launcher: Webserver Crashed
2024-09-05T02:13:46.514337Z  INFO text_generation_launcher: Shutting down shards
2024-09-05T02:13:46.601775Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-09-05T02:13:46.601799Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-09-05T02:14:18.027719Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed

tthakkal

2 cards bf16 doesn't work, please share multi card Tgi-Gaudi command you tested with

yuanwu2017 · 2024-09-05T02:48:59Z

Run docker contianer:

docker run -it -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -v ~/workspace:/workspace \
   -e HABANA_VISIBLE_DEVICES=6,7 \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
   -e ENABLE_HPU_GRAPH=true \
   -e LIMIT_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17

Enter container and build latest optimum-habana:

cd optimum-habana
pip install -e .

Run tgi-server BF16 with FLASH_ATTENTION:

export MODEL_ID=llava-hf/llava-v1.6-mistral-7b-hf
export DBG_TRACE_FILENAME=./dbg_trace.log
rm dbg_trace.log
export LOG_LEVEL=trace 
text-generation-launcher --model-id $MODEL_ID --max-input-tokens 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 8192 --sharded true --num-shard 2

Run tgi-server BF16 without FLASH_ATTENTION:

export MODEL_ID=llava-hf/llava-v1.6-mistral-7b-hf
export DBG_TRACE_FILENAME=./dbg_trace.log
rm dbg_trace.log
export LOG_LEVEL=trace 
export USE_FLASH_ATTENTION=false
export FLASH_ATTENTION_RECOMPUTE=false 
text-generation-launcher --model-id $MODEL_ID --max-input-tokens 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 8192 --sharded true --num-shard 2

tthakkal · 2024-09-05T04:46:49Z

Looks like we missed adding PT_HPU_ENABLE_LAZY_COLLECTIVES=true, tested bf16 2 cards. it works as expected. Thanks

-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \

yuanwu2017 · 2024-09-05T04:47:53Z

For FP8 multi-cards, I made a patch to generate the quantization files.
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --world_size 2 --use_deepspeed run_pipeline.py --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf --image_path "https://llava-vl.github.io/static/images/view.jpg" --use_hpu_graphs --bf16 --use_flash_attention --flash_attention_recompute

But I got a error, when running the 2-cards inference with FP8. I have no idea about it. Have any of you encountered this error? @mandy-li @tthakkal

regisss · 2024-09-05T08:31:24Z

So I can:

Do a ptch release of Optimum Habana with Add the deepspeed injection_policy of mistral optimum-habana#1309 and Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute optimum-habana#1278
Then, we can merge this PR and Llava-next: Added flash_attention_recompute option #220, and do a new release of TGI with updated Optimum Habana dependency

Does that sound good? Should we wait for the AutoGPTQ PRs?

yuanwu2017 · 2024-09-05T08:59:34Z

So I can:

Do a ptch release of Optimum Habana with Add the deepspeed injection_policy of mistral optimum-habana#1309 and Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute optimum-habana#1278

Then, we can merge this PR and Llava-next: Added flash_attention_recompute option #220, and do a new release of TGI with updated Optimum Habana dependency

Does that sound good? Should we wait for the AutoGPTQ PRs?
We can merge #217. It doesn't affect non-GPTQ model. When the optimum patch is merged, and new optimum is released, I will update the documents.

yuanwu2017 · 2024-09-05T09:03:27Z

For this patch, we can merge firstly. Because the BF16 is ok. I will go on debugging the FP8 issue. @tthakkal @mandy-li @regisss Do you think this is okay?

tthakkal · 2024-09-05T16:09:41Z

For this patch, we can merge firstly. Because the BF16 is ok. I will go on debugging the FP8 issue. @tthakkal @mandy-li @regisss Do you think this is okay?

@yuanwu2017 FP8 issue probably related to PR 220. you may need changes in that PR to run FP8.

hchauhan123 · 2024-09-05T17:15:31Z

@yuanwu2017 I tested the model for fp8 run with 8x cards. I was able to run it. I have created my tgi image with both PR-219 and PR-220 and built using latest optimum-habana main branch. The quantization file was also generated in optimum-habana based on latest i.e main branch.

Used below command for fp8 and it works well.

docker run -it --rm -p 8085:80 \
 --runtime=habana \
 -v /sys/kernel/debug:/sys/kernel/debug \
 -v /tmp:/tmp \
 -e HUGGING_FACE_HUB_TOKEN=your_token \
 -v /home_local/labuser/tf/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/:/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/ \
 -e QUANT_CONFIG=/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/maxabs_quant.json \
 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
 -e HABANA_VISIBLE_DEVICES=all \
 -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
 -e ENABLE_HPU_GRAPH=true \
 -e LIMIT_HPU_GRAPH=true \
 -e USE_FLASH_ATTENTION=true \
 -e FLASH_ATTENTION_RECOMPUTE=true \
 -e PREFILL_BATCH_BUCKET_SIZE=1 \
 -e BATCH_BUCKET_SIZE=1 \
 --cap-add=sys_nice  \
 --ipc=host \
 --name llava_pr219 tgi_gaudi_hsc_pr219_pr220:latest \
 --model-id llava-hf/llava-v1.6-mistral-7b-hf \
 --sharded true --num-shard 8 \
 --max-input-tokens 4096 --max-batch-prefill-tokens 16384 --max-total-tokens 8192 --max-batch-total-tokens 32768

yuanwu2017 · 2024-09-05T17:19:47Z

👍 The PR220 should fix the error.

…

________________________________ 发件人: Harshvardhan Chauhan ***@***.***> 发送时间: Friday, September 6, 2024 1:15:54 AM 收件人: huggingface/tgi-gaudi ***@***.***> 抄送: Wu, Yuan ***@***.***>; Mention ***@***.***> 主题: Re: [huggingface/tgi-gaudi] Only Apply the TP in language_model (PR #219) @yuanwu2017<https://github.com/yuanwu2017> I tested the model for fp8 run with 8x cards. I was able to run it. I have created my tgi image with both PR-219 and PR-220 and built using latest optimum-habana main branch. The quantization file was also generated in optimum-habana based on latest i.e main branch. Used below command for fp8 and it works well. docker run -it --rm -p 8085:80 --runtime=habana -v /sys/kernel/debug:/sys/kernel/debug -v /tmp:/tmp -e HUGGING_FACE_HUB_TOKEN=your_token -v /home_local/labuser/tf/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/:/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/ -e QUANT_CONFIG=/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/maxabs_quant.json -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=1 --cap-add=sys_nice --ipc=host --name llava_pr219 tgi_gaudi_hsc_pr219_pr220:latest --model-id llava-hf/llava-v1.6-mistral-7b-hf --sharded true --num-shard 8 --max-input-tokens 4096 --max-batch-prefill-tokens 16384 --max-total-tokens 8192 --max-batch-total-tokens 32768 — Reply to this email directly, view it on GitHub<#219 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIIJ2KIM3IH7WXQHA2IR35TZVCGUVAVCNFSM6AAAAABNTZDV36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZSGI2TGMBZGI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

regisss · 2024-09-05T17:24:52Z

Waiting for @libinta to do the patch release in case there are other commits that should be included.

regisss

LGTM!

@mandy-li @tthakkal @yuanwu2017 I just published the patch release: https://github.com/huggingface/optimum-habana/releases/tag/v1.13.2

Only Apply the TP in language_model

a34b5fd

Signed-off-by: yuanwu <[email protected]>

mandy-li requested review from mandy-li, tthakkal and regisss September 4, 2024 13:59

regisss reviewed Sep 4, 2024

View reviewed changes

regisss approved these changes Sep 4, 2024

View reviewed changes

mandy-li approved these changes Sep 4, 2024

View reviewed changes

tthakkal requested changes Sep 5, 2024

View reviewed changes

tthakkal approved these changes Sep 5, 2024

View reviewed changes

regisss approved these changes Sep 6, 2024

View reviewed changes

regisss merged commit 2299b73 into huggingface:habana-main Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only Apply the TP in language_model #219

Only Apply the TP in language_model #219

yuanwu2017 commented Sep 4, 2024

yuanwu2017 commented Sep 4, 2024

regisss left a comment

tthakkal commented Sep 4, 2024

mandy-li left a comment

regisss commented Sep 4, 2024

mandy-li commented Sep 4, 2024

tthakkal commented Sep 5, 2024 •

edited

Loading

yuanwu2017 commented Sep 5, 2024

hchauhan123 commented Sep 5, 2024

tthakkal left a comment

yuanwu2017 commented Sep 5, 2024 •

edited

Loading

tthakkal commented Sep 5, 2024

yuanwu2017 commented Sep 5, 2024 •

edited

Loading

regisss commented Sep 5, 2024

yuanwu2017 commented Sep 5, 2024

yuanwu2017 commented Sep 5, 2024

tthakkal commented Sep 5, 2024

hchauhan123 commented Sep 5, 2024 •

edited

Loading

yuanwu2017 commented Sep 5, 2024 via email

regisss commented Sep 5, 2024

regisss left a comment

Only Apply the TP in language_model #219

Only Apply the TP in language_model #219

Conversation

yuanwu2017 commented Sep 4, 2024

What does this PR do?

Before submitting

Who can review?

yuanwu2017 commented Sep 4, 2024

regisss left a comment

Choose a reason for hiding this comment

tthakkal commented Sep 4, 2024

mandy-li left a comment

Choose a reason for hiding this comment

regisss commented Sep 4, 2024

mandy-li commented Sep 4, 2024

tthakkal commented Sep 5, 2024 • edited Loading

yuanwu2017 commented Sep 5, 2024

hchauhan123 commented Sep 5, 2024

tthakkal left a comment

Choose a reason for hiding this comment

yuanwu2017 commented Sep 5, 2024 • edited Loading

tthakkal commented Sep 5, 2024

yuanwu2017 commented Sep 5, 2024 • edited Loading

regisss commented Sep 5, 2024

yuanwu2017 commented Sep 5, 2024

yuanwu2017 commented Sep 5, 2024

tthakkal commented Sep 5, 2024

hchauhan123 commented Sep 5, 2024 • edited Loading

yuanwu2017 commented Sep 5, 2024 via email

regisss commented Sep 5, 2024

regisss left a comment

Choose a reason for hiding this comment

tthakkal commented Sep 5, 2024 •

edited

Loading

yuanwu2017 commented Sep 5, 2024 •

edited

Loading

yuanwu2017 commented Sep 5, 2024 •

edited

Loading

hchauhan123 commented Sep 5, 2024 •

edited

Loading