Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in prompt processing speed using a batch size of 1024 #6075

Closed
Dampfinchen opened this issue Mar 15, 2024 · 3 comments
Closed

Regression in prompt processing speed using a batch size of 1024 #6075

Dampfinchen opened this issue Mar 15, 2024 · 3 comments

Comments

@Dampfinchen
Copy link

Dampfinchen commented Mar 15, 2024

Hello,

I've noticed a significant speed reduction in prompt processing when comparing the latest llama.cpp builds to slightly older ones.

I think it has something to do with the batch size. The speed at a batch size of 512 is the same as it always has been, but if I'm using -b 1024 it's significantly slower.

Comparison latest llama.cpp: -n 180 -c 4096 -t 6 --gpu-layers 5 --ignore-eos -b 1024, Mixtral IQ4_XS, Core i7 9750H, 32 GB RAM, RTX 2060

version: 2431 (4755afd)

llama_print_timings:        load time =    2339,43 ms
llama_print_timings:      sample time =      67,74 ms /   180 runs   (    0,38 ms per token,  2657,10 tokens per second)
llama_print_timings: prompt eval time =   72387,34 ms /  3602 tokens (   20,10 ms per token,    49,76 tokens per second)
llama_print_timings:        eval time =   44119,33 ms /   179 runs   (  246,48 ms per token,     4,06 tokens per second)
llama_print_timings:       total time =  116631,73 ms /  3781 tokens

version: 2405 (5cdb371)

llama_print_timings:        load time =    2482,92 ms
llama_print_timings:      sample time =      69,55 ms /   180 runs   (    0,39 ms per token,  2587,99 tokens per second)
llama_print_timings: prompt eval time =   51669,64 ms /  3602 tokens (   14,34 ms per token,    69,71 tokens per second)
llama_print_timings:        eval time =   42287,08 ms /   179 runs   (  236,24 ms per token,     4,23 tokens per second)
llama_print_timings:       total time =   94085,31 ms /  3781 tokens

@slaren Do you think there is a commit that could have caused this? Listening to the coil whine of my laptop while processing the prompt, there's a very noticeable different in the sound. With the recent commit, it sounds like it's processing two 512 batches instead of one 1024 batch (there's a noticeable pause in the coil whine at some point) even though in the terminal it looks like the usual 1024 batch size. With the older commit, there is no such pause and the sound is continuous for the whole 1024 tokens.

The speed difference is quite stark (20 ms/t vs 14 ms/t). I hope you can take a look at this! Thank you

@LostRuins
Copy link
Collaborator

LostRuins commented Mar 15, 2024

Probably happened after #6017

From that PR

Automatic batch splitting in llama_decode
llama_decode automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch size
The largest batch size that can be submitted to llama_decode is still limited by n_batch to reduce the size of the logits and embeddings buffers
Adds n_ubatch (-ub in the command line) to llama_context_params parameter
n_batch sets the size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode
n_ubatch sets the maximum batch size for computation
By default n_batch is 4096, n_ubatch is 512
This allows current applications to take advantage of pipeline parallelism by setting a larger n_batch without having to update their logic

What happens if you run -ub 1024

@Dampfinchen
Copy link
Author

Dampfinchen commented Mar 15, 2024

Probably happened after #6017

From that PR

Automatic batch splitting in llama_decode
llama_decode automatically splits the batches into multiple smaller batches if it is too big for the configured compute batch size
The largest batch size that can be submitted to llama_decode is still limited by n_batch to reduce the size of the logits and embeddings buffers
Adds n_ubatch (-ub in the command line) to llama_context_params parameter
n_batch sets the size of the logits and embeddings buffer, which limits the maximum batch size passed to llama_decode
n_ubatch sets the maximum batch size for computation
By default n_batch is 4096, n_ubatch is 512
This allows current applications to take advantage of pipeline parallelism by setting a larger n_batch without having to update their logic

What happens if you run -ub 1024

Yes, I can confirm this fixes it.

Although I have the feeling it uses more VRAM than before. Needs more testing.

Edit: Nope, my testing shows no increase of VRAM. All is good.

@slaren
Copy link
Collaborator

slaren commented Mar 15, 2024

Looks like you already figured it out, the parameter to change the physical batch size is now -ub. I will open a PR later today that should improve batch performance with partial offloading significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants