Multi GPU `--split-mode row` speed regression #6476

8XXD8 · 2024-04-04T09:56:33Z

Since b2475 row split and layer split has the same performance.
llama-bench is not affected, but main and server has this regression.

b2474

main
llama_print_timings: load time = 9945.29 ms
llama_print_timings: sample time = 4.05 ms / 128 runs ( 0.03 ms per token, 31565.97 tokens per second)
llama_print_timings: prompt eval time = 1712.75 ms / 15 tokens ( 114.18 ms per token, 8.76 tokens per second)
llama_print_timings: eval time = 9521.36 ms / 127 runs ( 74.97 ms per token, 13.34 tokens per second)
llama_print_timings: total time = 11268.98 ms / 142 tokens

server

{"function":"print_timings","id_slot":0,"id_task":0,"level":"INFO","line":322,"msg":"generation eval time = 23176.51 ms / 281 runs ( 82.48 ms per token, 12.12 tokens per second)","n_decoded":281,"n_tokens_second":12.124345910954315,"t_token":82.47867615658363,"t_token_generation":23176.508,"tid":"139827722453120","timestamp":1712220482}

b2475

main
llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40933.80 tokens per second)
llama_print_timings: prompt eval time = 3413.32 ms / 15 tokens ( 227.55 ms per token, 4.39 tokens per second)
llama_print_timings: eval time = 14874.55 ms / 127 runs ( 117.12 ms per token, 8.54 tokens per second)
llama_print_timings: total time = 18340.76 ms / 142 tokens

server
{"function":"print_timings","id_slot":0,"id_task":0,"level":"INFO","line":322,"msg":"generation eval time = 38207.86 ms / 313 runs ( 122.07 ms per token, 8.19 tokens per second)","n_decoded":313,"n_tokens_second":8.192032335129394,"t_token":122.06983067092654,"t_token_generation":38207.857,"tid":"139892693971072","timestamp":1712220597}

llama-bench

model size params backend ngl sm test t/s

llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row pp 512 21.78 ± 0.03

llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row tg 128 12.96 ± 0.02

build: ccf58aa (1)

b2600

main
llama_print_timings: load time = 9996.37 ms
llama_print_timings: sample time = 3.06 ms / 128 runs ( 0.02 ms per token, 41871.12 tokens per second)
llama_print_timings: prompt eval time = 3380.90 ms / 15 tokens ( 225.39 ms per token, 4.44 tokens per second)
llama_print_timings: eval time = 14900.67 ms / 127 runs ( 117.33 ms per token, 8.52 tokens per second)
llama_print_timings: total time = 18311.99 ms / 142 tokens

server
{"tid":"139684675611648","timestamp":1712223015,"level":"INFO","function":"print_timings","line":332,"msg":"generation eval time = 35799.18 ms / 295 runs ( 121.35 ms per token, 8.24 tokens per second)","id_slot":0,"id_task":0,"t_token_generation":35799.182,"n_decoded":295,"t_token":121.3531593220339,"n_tokens_second":8.240411750190269}

llama-bench

model size params backend ngl sm test t/s

llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row pp 512 22.02 ± 0.04

llama 70B Q4_K - Medium 38.58 GiB 68.98 B ROCm 99 row tg 128 12.93 ± 0.01

build: 4399f13 (2600)

Commands used:

HIP_VISIBLE_DEVICES=0,1,2 ./llama-bench -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf

HIP_VISIBLE_DEVICES=0,1,2 ./main -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -t 4 -ngl 99 --seed 1234 -n 128 --ignore-eos -p "USER: Tell me a joke ASSISTANT: " --split-mode row

HIP_VISIBLE_DEVICES=0,1,2 ./server -t 4 -ngl 99 -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -c 4096 -ts 8,10,10 -b 512 --port 8080 --host 192.168.0.87

The text was updated successfully, but these errors were encountered:

8XXD8 · 2024-04-06T12:05:27Z

With a large model, like 120b q3k_m, there is a small speed increase in token generation speed 3,5 t/s vs 3 t/s with layer split.
This was 4,95 t/s with b2474 and before.
Interestingly, there is no regression with prompt processing speed, it is scaling as expected with the number of GPUs.
Has anyone else noticed this slowdown with a Nvidia multi GPU setup?

JohannesGaessler · 2024-04-07T20:00:40Z

I am observing no performance difference whatsoever on 3x P40.

jukofyork · 2024-04-08T20:58:13Z

+1

I have 2x A6000 and an nvlink bridge.

Using --split-mode row I used to get around 40% improvement in tokens/s, but now get almost no improvement with quantized models and only around 10-15% more for fp16 models.

I also used to be able to get some improvement in evaluation speed by upping the batch size to 1024 or 2048 too, but this now actually slightly reduces my tokens/s for both "row" and "layer" modes.

phymbert · 2024-04-08T21:00:40Z

Have you also played with --ubatch-size ?

jukofyork · 2024-04-08T21:05:57Z

Have you also played with --ubatch-size ?

No - I never seen that option until now! Just looked in the code and see:

- `-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `2048`
- `-ub N`, `--ubatch-size N`: Physical maximum batch size. Default: `512`

What is the difference and has batch-size now had its default raised to 2048 from the old 512?

phymbert · 2024-04-08T21:08:33Z

See What's the difference between batch-size and ubatch-size? #6328

jukofyork · 2024-04-08T21:17:46Z

See What's the difference between batch-size and ubatch-size? #6328

Thanks! Also found a couple of other threads about this:

#6075
#6017

I'll have a play with it tomorrow and report back.

jukofyork · 2024-04-08T21:32:16Z

#6263 also looks important to consider.

jukofyork · 2024-04-10T18:27:05Z

Just a quick follow-up to say using --batch-size 1024 and --ubatch-size 1024 has fixed my problems (thanks @phymbert !).

I'm now actually getting slightly better improvement from using --split-mode row than I did before (~50% increase vs ~40% increase before).

phymbert · 2024-04-10T19:11:17Z

Great to here, but from what I understand from @slaren is ubatch-size now aims to be 256/512 max, not equals to logical batch size. Please confirm

slaren · 2024-04-10T19:17:22Z

There is no problem increasing ubatch-size if it improves performance on some hardware.

jukofyork · 2024-04-10T19:27:16Z

Great to here, but from what I understand from @slaren is ubatch-size now aims to be 256/512 max, not equals to logical batch size. Please confirm

I just set to 1024 for both as that is what I used to set the old --batch-size to for comparison.

There is no problem increasing ubatch-size if it improves performance on some hardware.

I have 2 x A6000 and an NVLink bridge. They are in a dual CPU board in PCI-e 3.0 16x slots not on the same CPU. I think that probably means I get such a large speed increase from using --split-mode row as it uses the ~56GB/s NVLink bridge.

From memory vs a 1-2 month old version of llama.cpp I think both --split-mode row and --split-mode layer are running slightly faster than they were (around ~10% more each in tokens/s).

8XXD8 added the bug-unconfirmed label Apr 4, 2024

8XXD8 changed the title ~~Row split is not working~~ Multi GPU --split-mode row speed regression Apr 6, 2024

jukofyork mentioned this issue Apr 8, 2024

Implement 'split_mode' and 'tensor_split' support in modelfiles ollama/ollama#3540

Closed

jukofyork mentioned this issue Apr 9, 2024

Potential problems with the llm/ext_server/server.cpp not accepting --ubatch-size option ollama/ollama#3554

Open

slaren closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU `--split-mode row` speed regression #6476

Multi GPU `--split-mode row` speed regression #6476

8XXD8 commented Apr 4, 2024

8XXD8 commented Apr 6, 2024 •

edited

Loading

JohannesGaessler commented Apr 7, 2024

jukofyork commented Apr 8, 2024 •

edited

Loading

phymbert commented Apr 8, 2024

jukofyork commented Apr 8, 2024

phymbert commented Apr 8, 2024

jukofyork commented Apr 8, 2024

jukofyork commented Apr 8, 2024

jukofyork commented Apr 10, 2024

phymbert commented Apr 10, 2024 •

edited

Loading

slaren commented Apr 10, 2024

jukofyork commented Apr 10, 2024 •

edited

Loading

Multi GPU --split-mode row speed regression #6476

Multi GPU --split-mode row speed regression #6476

Comments

8XXD8 commented Apr 4, 2024

8XXD8 commented Apr 6, 2024 • edited Loading

JohannesGaessler commented Apr 7, 2024

jukofyork commented Apr 8, 2024 • edited Loading

phymbert commented Apr 8, 2024

jukofyork commented Apr 8, 2024

phymbert commented Apr 8, 2024

jukofyork commented Apr 8, 2024

jukofyork commented Apr 8, 2024

jukofyork commented Apr 10, 2024

phymbert commented Apr 10, 2024 • edited Loading

slaren commented Apr 10, 2024

jukofyork commented Apr 10, 2024 • edited Loading

Multi GPU `--split-mode row` speed regression #6476

Multi GPU `--split-mode row` speed regression #6476

8XXD8 commented Apr 6, 2024 •

edited

Loading

jukofyork commented Apr 8, 2024 •

edited

Loading

phymbert commented Apr 10, 2024 •

edited

Loading

jukofyork commented Apr 10, 2024 •

edited

Loading