Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only Apply the TP in language_model #219

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

yuanwu2017
Copy link

What does this PR do?

Fix the llava-next crash with multi-cards.

Depend on:
huggingface/optimum-habana#1309

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yuanwu2017
Copy link
Author

@mandy-li @tthakkal Please review.

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
I just merged huggingface/optimum-habana#1309.
@tthakkal @mandy-li Waiting for your approval before merging.

@tthakkal
Copy link
Collaborator

tthakkal commented Sep 4, 2024

@yuanwu2017 the problem is on habana-main branch we don't use OH main to build, we use specific OH release. Not sure how we should merge this change @mandy-li.

Copy link
Collaborator

@mandy-li mandy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@regisss
Copy link
Collaborator

regisss commented Sep 4, 2024

@yuanwu2017 the problem is on habana-main branch we don't use OH main to build, we use specific OH release. Not sure how we should merge this change @mandy-li.

I can do a patch release of Optimum Habana if you need this now.

@mandy-li
Copy link
Collaborator

mandy-li commented Sep 4, 2024

@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ?

if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table

@tthakkal
Copy link
Collaborator

tthakkal commented Sep 5, 2024

@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ?

if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table

@regisss Please also include OH commit huggingface/optimum-habana@7e4d7f1 in OH patch release and also include TGI PR #220

@yuanwu2017
Copy link
Author

@regisss , patch release would be great. @yuanwu2017 , what other PRs you want to include patch release? #217 ?
Don't include 217. the 217 depends on huggingface/optimum#2003.
if we have a patch release, @yuanwu2017 , please modify README to remove the limitation of Llava-next can only work on 1x card and also add multi-cards config to the tested model config table
Ok.

@hchauhan123
Copy link

@yuanwu2017 I tried your PR with below docker command (bf16). I also tried fp8 and fails with same error during warmup:

docker run -it --rm -p 8085:80 --runtime=habana -v /sys/kernel/debug:/sys/kernel/debug -v /tmp:/tmp -e HUGGING_FACE_HUB_TOKEN=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 -e HABANA_VISIBLE_DEVICES=0,1 -e PT_HPU_RECIPE_CACHE_CONFIG=/root/data2/tgi_cache/tgi-llava_v1_6/mistral_cache,false,8192 -e DBG_TRACE_FILENAME=/root/data2/tgi_cache/tgi-llava_v1_6/llava_v1_6_logfile.log -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=1 --cap-add=sys_nice --ipc=host --name llava_pr219 tgi_gaudi_hsc_pr219_pr220:latest --model-id llava-hf/llava-v1.6-mistral-7b-hf --sharded true --num-shard 2 --max-input-tokens 4096 --max-batch-prefill-tokens 16384 --max-total-tokens 8192 --max-batch-total-tokens 32768

and it fails with below error:

2024-09-05T02:13:33.896967Z  INFO text_generation_router: router/src/main.rs:516: Serving revision 216670a16460adb7c41ce3e123ceb3859f73ab12 of model llava-hf/llava-v1.6-mistral-7b-hf
2024-09-05T02:13:33.936579Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<image>' was expected to have ID '32000' but was given ID 'None'
2024-09-05T02:13:33.936589Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<pad>' was expected to have ID '32001' but was given ID 'None'
2024-09-05T02:13:33.936735Z  INFO text_generation_router: router/src/main.rs:317: Using config Some(LlavaNext(LlavaNext { text_config: TextConfig, vision_config: VisionConfig { image_size: 336, patch_size: 14 }, image_grid_pinpoints: [(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)] }))
2024-09-05T02:13:33.941632Z  INFO text_generation_router: router/src/main.rs:345: Warming up model
2024-09-05T02:13:46.022225Z ERROR warmup{max_input_length=4096 max_prefill_tokens=16384 max_total_tokens=8192 max_batch_size=Some(4) model_id="llava-hf/llava-v1.6-mistral-7b-hf"}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-09-05T02:13:46.486976Z ERROR warmup{max_input_length=4096 max_prefill_tokens=16384 max_total_tokens=8192 max_batch_size=Some(4) model_id="llava-hf/llava-v1.6-mistral-7b-hf"}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
Error: Warmup(Generation("CANCELLED"))
2024-09-05T02:13:46.514321Z ERROR text_generation_launcher: Webserver Crashed
2024-09-05T02:13:46.514337Z  INFO text_generation_launcher: Shutting down shards
2024-09-05T02:13:46.601775Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-09-05T02:13:46.601799Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-09-05T02:14:18.027719Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed


Copy link
Collaborator

@tthakkal tthakkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 cards bf16 doesn't work, please share multi card Tgi-Gaudi command you tested with

@yuanwu2017
Copy link
Author

yuanwu2017 commented Sep 5, 2024

Run docker contianer:

docker run -it -p 8083:80 \
   --runtime=habana \
   -v $volume:/data \
   -v ~/workspace:/workspace \
   -e HABANA_VISIBLE_DEVICES=6,7 \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 \
   -e ENABLE_HPU_GRAPH=true \
   -e LIMIT_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   tgi-yuanwu:1.17

Enter container and build latest optimum-habana:

cd optimum-habana
pip install -e .
image

Run tgi-server BF16 with FLASH_ATTENTION:

export MODEL_ID=llava-hf/llava-v1.6-mistral-7b-hf
export DBG_TRACE_FILENAME=./dbg_trace.log
rm dbg_trace.log
export LOG_LEVEL=trace 
text-generation-launcher --model-id $MODEL_ID --max-input-tokens 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 8192 --sharded true --num-shard 2
image

Run tgi-server BF16 without FLASH_ATTENTION:

export MODEL_ID=llava-hf/llava-v1.6-mistral-7b-hf
export DBG_TRACE_FILENAME=./dbg_trace.log
rm dbg_trace.log
export LOG_LEVEL=trace 
export USE_FLASH_ATTENTION=false
export FLASH_ATTENTION_RECOMPUTE=false 
text-generation-launcher --model-id $MODEL_ID --max-input-tokens 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 8192 --sharded true --num-shard 2

@tthakkal
Copy link
Collaborator

tthakkal commented Sep 5, 2024

Looks like we missed adding PT_HPU_ENABLE_LAZY_COLLECTIVES=true, tested bf16 2 cards. it works as expected. Thanks

-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \

@yuanwu2017
Copy link
Author

yuanwu2017 commented Sep 5, 2024

For FP8 multi-cards, I made a patch to generate the quantization files.
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --world_size 2 --use_deepspeed run_pipeline.py --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf --image_path "https://llava-vl.github.io/static/images/view.jpg" --use_hpu_graphs --bf16 --use_flash_attention --flash_attention_recompute

But I got a error, when running the 2-cards inference with FP8. I have no idea about it. Have any of you encountered this error? @mandy-li @tthakkal

image

@regisss
Copy link
Collaborator

regisss commented Sep 5, 2024

So I can:

  1. Do a ptch release of Optimum Habana with Add the deepspeed injection_policy of mistral optimum-habana#1309 and Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute optimum-habana#1278
  2. Then, we can merge this PR and Llava-next: Added flash_attention_recompute option #220, and do a new release of TGI with updated Optimum Habana dependency

Does that sound good? Should we wait for the AutoGPTQ PRs?

@yuanwu2017
Copy link
Author

So I can:

  1. Do a ptch release of Optimum Habana with Add the deepspeed injection_policy of mistral optimum-habana#1309 and Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute optimum-habana#1278
  2. Then, we can merge this PR and Llava-next: Added flash_attention_recompute option #220, and do a new release of TGI with updated Optimum Habana dependency

Does that sound good? Should we wait for the AutoGPTQ PRs?
We can merge #217. It doesn't affect non-GPTQ model. When the optimum patch is merged, and new optimum is released, I will update the documents.

@yuanwu2017
Copy link
Author

For this patch, we can merge firstly. Because the BF16 is ok. I will go on debugging the FP8 issue. @tthakkal @mandy-li @regisss Do you think this is okay?

@tthakkal
Copy link
Collaborator

tthakkal commented Sep 5, 2024

For this patch, we can merge firstly. Because the BF16 is ok. I will go on debugging the FP8 issue. @tthakkal @mandy-li @regisss Do you think this is okay?

@yuanwu2017 FP8 issue probably related to PR 220. you may need changes in that PR to run FP8.

@hchauhan123
Copy link

hchauhan123 commented Sep 5, 2024

@yuanwu2017 I tested the model for fp8 run with 8x cards. I was able to run it. I have created my tgi image with both PR-219 and PR-220 and built using latest optimum-habana main branch. The quantization file was also generated in optimum-habana based on latest i.e main branch.

Used below command for fp8 and it works well.

docker run -it --rm -p 8085:80 \
 --runtime=habana \
 -v /sys/kernel/debug:/sys/kernel/debug \
 -v /tmp:/tmp \
 -e HUGGING_FACE_HUB_TOKEN=your_token \
 -v /home_local/labuser/tf/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/:/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/ \
 -e QUANT_CONFIG=/root/all_hqt_config/hqt_output_llava_v1_6_mistral_7b_v17_495_pr219_new/maxabs_quant.json \
 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
 -e HABANA_VISIBLE_DEVICES=all \
 -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
 -e ENABLE_HPU_GRAPH=true \
 -e LIMIT_HPU_GRAPH=true \
 -e USE_FLASH_ATTENTION=true \
 -e FLASH_ATTENTION_RECOMPUTE=true \
 -e PREFILL_BATCH_BUCKET_SIZE=1 \
 -e BATCH_BUCKET_SIZE=1 \
 --cap-add=sys_nice  \
 --ipc=host \
 --name llava_pr219 tgi_gaudi_hsc_pr219_pr220:latest \
 --model-id llava-hf/llava-v1.6-mistral-7b-hf \
 --sharded true --num-shard 8 \
 --max-input-tokens 4096 --max-batch-prefill-tokens 16384 --max-total-tokens 8192 --max-batch-total-tokens 32768

@yuanwu2017
Copy link
Author

yuanwu2017 commented Sep 5, 2024 via email

@regisss
Copy link
Collaborator

regisss commented Sep 5, 2024

Waiting for @libinta to do the patch release in case there are other commits that should be included.

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss regisss merged commit 2299b73 into huggingface:habana-main Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants