Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU --split-mode row speed regression #6476

Closed
8XXD8 opened this issue Apr 4, 2024 · 12 comments
Closed

Multi GPU --split-mode row speed regression #6476

8XXD8 opened this issue Apr 4, 2024 · 12 comments

Comments

@8XXD8
Copy link

8XXD8 commented Apr 4, 2024

Since b2475 row split and layer split has the same performance.
llama-bench is not affected, but main and server has this regression.

b2474

main
llama_print_timings: load time = 9945.29 ms
llama_print_timings: sample time = 4.05 ms / 128 runs ( 0.03 ms per token, 31565.97 tokens per second)
llama_print_timings: prompt eval time = 1712.75 ms / 15 tokens ( 114.18 ms per token, 8.76 tokens per second)
llama_print_timings: eval time = 9521.36 ms / 127 runs ( 74.97 ms per token, 13.34 tokens per second)
llama_print_timings: total time = 11268.98 ms / 142 tokens

server

{"function":"print_timings","id_slot":0,"id_task":0,"level":"INFO","line":322,"msg":"generation eval time = 23176.51 ms / 281 runs ( 82.48 ms per token, 12.12 tokens per second)","n_decoded":281,"n_tokens_second":12.124345910954315,"t_token":82.47867615658363,"t_token_generation":23176.508,"tid":"139827722453120","timestamp":1712220482}

b2475

main
llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40933.80 tokens per second)
llama_print_timings: prompt eval time = 3413.32 ms / 15 tokens ( 227.55 ms per token, 4.39 tokens per second)
llama_print_timings: eval time = 14874.55 ms / 127 runs ( 117.12 ms per token, 8.54 tokens per second)
llama_print_timings: total time = 18340.76 ms / 142 tokens

server
{"function":"print_timings","id_slot":0,"id_task":0,"level":"INFO","line":322,"msg":"generation eval time = 38207.86 ms / 313 runs ( 122.07 ms per token, 8.19 tokens per second)","n_decoded":313,"n_tokens_second":8.192032335129394,"t_token":122.06983067092654,"t_token_generation":38207.857,"tid":"139892693971072","timestamp":1712220597}

llama-bench

model size params backend ngl sm test t/s
llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row pp 512 21.78 ± 0.03
llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row tg 128 12.96 ± 0.02

build: ccf58aa (1)

b2600

main
llama_print_timings: load time = 9996.37 ms
llama_print_timings: sample time = 3.06 ms / 128 runs ( 0.02 ms per token, 41871.12 tokens per second)
llama_print_timings: prompt eval time = 3380.90 ms / 15 tokens ( 225.39 ms per token, 4.44 tokens per second)
llama_print_timings: eval time = 14900.67 ms / 127 runs ( 117.33 ms per token, 8.52 tokens per second)
llama_print_timings: total time = 18311.99 ms / 142 tokens

server
{"tid":"139684675611648","timestamp":1712223015,"level":"INFO","function":"print_timings","line":332,"msg":"generation eval time = 35799.18 ms / 295 runs ( 121.35 ms per token, 8.24 tokens per second)","id_slot":0,"id_task":0,"t_token_generation":35799.182,"n_decoded":295,"t_token":121.3531593220339,"n_tokens_second":8.240411750190269}

llama-bench

model size params backend ngl sm test t/s
llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row pp 512 22.02 ± 0.04
llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row tg 128 12.93 ± 0.01

build: 4399f13 (2600)

Commands used:

HIP_VISIBLE_DEVICES=0,1,2 ./llama-bench -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf

HIP_VISIBLE_DEVICES=0,1,2 ./main -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -t 4 -ngl 99 --seed 1234 -n 128 --ignore-eos -p "USER: Tell me a joke ASSISTANT: " --split-mode row

HIP_VISIBLE_DEVICES=0,1,2 ./server -t 4 -ngl 99 -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -c 4096 -ts 8,10,10 -b 512 --port 8080 --host 192.168.0.87

@8XXD8 8XXD8 changed the title Row split is not working Multi GPU --split-mode row speed regression Apr 6, 2024
@8XXD8
Copy link
Author

8XXD8 commented Apr 6, 2024

With a large model, like 120b q3k_m, there is a small speed increase in token generation speed 3,5 t/s vs 3 t/s with layer split.
This was 4,95 t/s with b2474 and before.
Interestingly, there is no regression with prompt processing speed, it is scaling as expected with the number of GPUs.
Has anyone else noticed this slowdown with a Nvidia multi GPU setup?

@JohannesGaessler
Copy link
Collaborator

I am observing no performance difference whatsoever on 3x P40.

@jukofyork
Copy link
Contributor

jukofyork commented Apr 8, 2024

+1

I have 2x A6000 and an nvlink bridge.

Using --split-mode row I used to get around 40% improvement in tokens/s, but now get almost no improvement with quantized models and only around 10-15% more for fp16 models.

I also used to be able to get some improvement in evaluation speed by upping the batch size to 1024 or 2048 too, but this now actually slightly reduces my tokens/s for both "row" and "layer" modes.

@phymbert
Copy link
Collaborator

phymbert commented Apr 8, 2024

Have you also played with --ubatch-size ?

@jukofyork
Copy link
Contributor

Have you also played with --ubatch-size ?

No - I never seen that option until now! Just looked in the code and see:

- `-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `2048`
- `-ub N`, `--ubatch-size N`: Physical maximum batch size. Default: `512`

What is the difference and has batch-size now had its default raised to 2048 from the old 512?

@phymbert
Copy link
Collaborator

phymbert commented Apr 8, 2024

@jukofyork
Copy link
Contributor

Thanks! Also found a couple of other threads about this:

#6075
#6017

I'll have a play with it tomorrow and report back.

@jukofyork
Copy link
Contributor

#6263 also looks important to consider.

@jukofyork
Copy link
Contributor

Just a quick follow-up to say using --batch-size 1024 and --ubatch-size 1024 has fixed my problems (thanks @phymbert !).

I'm now actually getting slightly better improvement from using --split-mode row than I did before (~50% increase vs ~40% increase before).

@phymbert
Copy link
Collaborator

phymbert commented Apr 10, 2024

Great to here, but from what I understand from @slaren is ubatch-size now aims to be 256/512 max, not equals to logical batch size. Please confirm

@slaren
Copy link
Collaborator

slaren commented Apr 10, 2024

There is no problem increasing ubatch-size if it improves performance on some hardware.

@slaren slaren closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024
@jukofyork
Copy link
Contributor

jukofyork commented Apr 10, 2024

Great to here, but from what I understand from @slaren is ubatch-size now aims to be 256/512 max, not equals to logical batch size. Please confirm

I just set to 1024 for both as that is what I used to set the old --batch-size to for comparison.

There is no problem increasing ubatch-size if it improves performance on some hardware.

I have 2 x A6000 and an NVLink bridge. They are in a dual CPU board in PCI-e 3.0 16x slots not on the same CPU. I think that probably means I get such a large speed increase from using --split-mode row as it uses the ~56GB/s NVLink bridge.

From memory vs a 1-2 month old version of llama.cpp I think both --split-mode row and --split-mode layer are running slightly faster than they were (around ~10% more each in tokens/s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants