-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only Apply the TP in language_model #219
Conversation
Signed-off-by: yuanwu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I just merged huggingface/optimum-habana#1309.
@tthakkal @mandy-li Waiting for your approval before merging.
@yuanwu2017 the problem is on habana-main branch we don't use OH main to build, we use specific OH release. Not sure how we should merge this change @mandy-li. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I can do a patch release of Optimum Habana if you need this now. |
@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ? if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table |
@regisss Please also include OH commit huggingface/optimum-habana@7e4d7f1 in OH patch release and also include TGI PR #220 |
|
@yuanwu2017 I tried your PR with below docker command (bf16). I also tried fp8 and fails with same error during warmup:
and it fails with below error:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 cards bf16 doesn't work, please share multi card Tgi-Gaudi command you tested with
Looks like we missed adding
|
For FP8 multi-cards, I made a patch to generate the quantization files. But I got a error, when running the 2-cards inference with FP8. I have no idea about it. Have any of you encountered this error? @mandy-li @tthakkal |
So I can:
Does that sound good? Should we wait for the AutoGPTQ PRs? |
|
@yuanwu2017 FP8 issue probably related to PR 220. you may need changes in that PR to run FP8. |
@yuanwu2017 I tested the model for fp8 run with 8x cards. I was able to run it. I have created my tgi image with both PR-219 and PR-220 and built using latest optimum-habana main branch. The quantization file was also generated in optimum-habana based on latest i.e main branch. Used below command for fp8 and it works well.
|
👍 The PR220 should fix the error.
…________________________________
发件人: Harshvardhan Chauhan ***@***.***>
发送时间: Friday, September 6, 2024 1:15:54 AM
收件人: huggingface/tgi-gaudi ***@***.***>
抄送: Wu, Yuan ***@***.***>; Mention ***@***.***>
主题: Re: [huggingface/tgi-gaudi] Only Apply the TP in language_model (PR #219)
@yuanwu2017<https://github.com/yuanwu2017> I tested the model for fp8 run with 8x cards. I was able to run it. I have created my tgi image with both PR-219 and PR-220 and built using latest optimum-habana main branch. The quantization file was also generated in optimum-habana based on latest i.e main branch.
Used below command for fp8 and it works well.
docker run -it --rm -p 8085:80
--runtime=habana
-v /sys/kernel/debug:/sys/kernel/debug
-v /tmp:/tmp
-e HUGGING_FACE_HUB_TOKEN=your_token
-v /home_local/labuser/tf/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/:/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/
-e QUANT_CONFIG=/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/maxabs_quant.json
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true
-e HABANA_VISIBLE_DEVICES=all
-e OMPI_MCA_btl_vader_single_copy_mechanism=none
-e ENABLE_HPU_GRAPH=true
-e LIMIT_HPU_GRAPH=true
-e USE_FLASH_ATTENTION=true
-e FLASH_ATTENTION_RECOMPUTE=true
-e PREFILL_BATCH_BUCKET_SIZE=1
-e BATCH_BUCKET_SIZE=1
--cap-add=sys_nice
--ipc=host
--name llava_pr219 tgi_gaudi_hsc_pr219_pr220:latest
--model-id llava-hf/llava-v1.6-mistral-7b-hf
--sharded true --num-shard 8
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 --max-total-tokens 8192 --max-batch-total-tokens 32768
—
Reply to this email directly, view it on GitHub<#219 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIIJ2KIM3IH7WXQHA2IR35TZVCGUVAVCNFSM6AAAAABNTZDV36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZSGI2TGMBZGI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Waiting for @libinta to do the patch release in case there are other commits that should be included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@mandy-li @tthakkal @yuanwu2017 I just published the patch release: https://github.com/huggingface/optimum-habana/releases/tag/v1.13.2
What does this PR do?
Fix the llava-next crash with multi-cards.
Depend on:
huggingface/optimum-habana#1309
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.