Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have n_batch default to 512 when BLAS is enabled #1091

Merged
merged 5 commits into from Apr 22, 2023
Merged

Have n_batch default to 512 when BLAS is enabled #1091

merged 5 commits into from Apr 22, 2023

Conversation

ghost
Copy link

@ghost ghost commented Apr 20, 2023

As GGML only uses BLAS when n_batch is 32 or larger it is not used by default even if llama.cpp is compiled with a BLAS lib. This is because the default n_batch is 8. My path sets n_batch to the maximum 512 when BLAS is enabled at compile-time and keeps at at 8 if there is no BLAS.

Personally I see the best performance with OpenBLAS and a size of 512, so this is why I chose this value. Experimentation may be needed to come up with a good default (as long as it's larger than 32).

This came out of a discussion at #1065 (comment).

@slaren
Copy link
Collaborator

slaren commented Apr 20, 2023

It may be better to check for BLAS availability with ggml_cpu_has_blas() instead.

@ghost
Copy link
Author

ghost commented Apr 20, 2023

Yeah that would also work. It really is the preference between hardcoding this default in the struct itself or bumping it up manually after instantiating gpt_params. I'll let @ggerganov comment.

@slaren
Copy link
Collaborator

slaren commented Apr 20, 2023

The initializer doesn't have to be constant, you can do something like this:

int32_t n_batch       = ggml_cpu_has_blas() ? 512 : 8;    // batch size for prompt processing

@ghost
Copy link
Author

ghost commented Apr 21, 2023

The initializer doesn't have to be constant, you can do something like this:

int32_t n_batch       = ggml_cpu_has_blas() ? 512 : 8;    // batch size for prompt processing

Aargh, I'm getting rusty 😉

94cb00a is the alternate implementation with ggml_cpu_has_blas() and c6dfc44 is the original. The alternate version has the advantage of not requiring a Makefile update.

@ghost
Copy link
Author

ghost commented Apr 21, 2023

Here's some prompt eval time results with q4_0 13B Llama, OpenBLAS, and my standard 320 token prompt. I ran this on Linux with a 16GB i5-6500. n_batch of 512 is a clear winner in my environment.

n_batch time (ms/token)
8 (no BLAS) ~270
32 ~415
128 ~200
256 ~170
512 ~145

@slaren
Copy link
Collaborator

slaren commented Apr 21, 2023

512 is also good for cuBLAS.

@ggerganov
Copy link
Owner

ggerganov commented Apr 21, 2023

Wonder if we should use 512 even without BLAS.
Any reason not to do it?

Adding ggml.h to common.h is not very desirable. I think we should just simplify it, unless it is not a good default for some use cases.

@DannyDaemonic
Copy link
Contributor

From a user experience point of view, the smaller batch sizes let you know where you are in your initial prompt evaluation as each batch is written out right before evaluation. Personally I'd happily trade that away for quicker processing but I do see how it might be nice to watch it churn through your prompt on those slow larger models.

I doubt anyone would make anything so intricate but it might be nice if there were some sort of callback to give you a percentage update for anyone who wants to make a GUI and wants to use larger batch sizes to utilize BLAS. (They could probably just evaluate a single token or two and then estimate the time to do a block of 512.)

@ghost
Copy link
Author

ghost commented Apr 22, 2023

b2e8a32 has n_batch of 512 set by default regardless if BLAS is enabled or not.

In my tests I saw a prompt eval time of around 260-270ms/token regardless of batch size without BLAS, using the same setup as my previous tests. I tried batch sizes between 8 and 512. Thus in my case the batch size does not matter.

As an aside I'm curious as to why n_batch is hard limited to 512 - is there a technical reason why we can't use larger values?

@ggerganov
Copy link
Owner

As an aside I'm curious as to why n_batch is hard limited to 512 - is there a technical reason why we can't use larger values?

I think going above 512 will have to increase some of the buffer sizes in llama.cpp:

llama.cpp/llama.cpp

Lines 45 to 97 in 872c365

static const size_t MB = 1024*1024;
// computed for n_ctx == 2048
// TODO: dynamically determine these sizes
// needs modifications in ggml
static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
{
static std::map<e_model, size_t> _MEM_REQ_SCRATCH0 = {
{ MODEL_7B, 512ull * MB },
{ MODEL_13B, 512ull * MB },
{ MODEL_30B, 512ull * MB },
{ MODEL_65B, 512ull * MB },
};
return _MEM_REQ_SCRATCH0;
}
static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
{
static std::map<e_model, size_t> _MEM_REQ_SCRATCH1 = {
{ MODEL_7B, 512ull * MB },
{ MODEL_13B, 512ull * MB },
{ MODEL_30B, 512ull * MB },
{ MODEL_65B, 512ull * MB },
};
return _MEM_REQ_SCRATCH1;
}
// 2*n_embd*n_ctx*n_layer*sizeof(float16)
static const std::map<e_model, size_t> & MEM_REQ_KV_SELF()
{
static std::map<e_model, size_t> _MEM_REQ_KV_SELF = {
{ MODEL_7B, 1026ull * MB },
{ MODEL_13B, 1608ull * MB },
{ MODEL_30B, 3124ull * MB },
{ MODEL_65B, 5120ull * MB },
};
return _MEM_REQ_KV_SELF;
}
// this is mostly needed for temporary mul_mat buffers to dequantize the data
// not actually needed if BLAS is disabled
static const std::map<e_model, size_t> & MEM_REQ_EVAL()
{
static std::map<e_model, size_t> _MEM_REQ_EVAL = {
{ MODEL_7B, 768ull * MB },
{ MODEL_13B, 1024ull * MB },
{ MODEL_30B, 1280ull * MB },
{ MODEL_65B, 1536ull * MB },
};
return _MEM_REQ_EVAL;
}

But not 100% sure.
If you give it a try with n_batch == 2048 and a very large prompt and it does not crash, we can remove this restriction

@ghost
Copy link
Author

ghost commented Apr 24, 2023

Yeah it segfaults alright with the 512 limit removed, n_batch of 2048, and a 2k+ length prompt. The same prompt has no issues with a n_batch of 512.

ggml_new_tensor_impl: not enough space in the scratch memory
<REDACTED>.sh: line 6:  5156 Segmentation fault      ./main -m <REDACTED>/ggml-model-llama13b-q4_0.bin -c 2048 -n 1 --keep -1 -s 1 --repeat_penalty 1.1 --top_k 0 --top_p 0.73 --temp 0.72 --color -f <REDACTED>.txt -b 2048

In valgrind:

ggml_new_tensor_impl: not enough space in the scratch memory
==5916== Invalid write of size 4
==5916==    at 0x125071: ggml_mul_mat (ggml.c:4918)
==5916==    by 0x132162: llama_eval_internal(llama_context&, int const*, int, int, int) (llama.cpp:1137)
==5916==    by 0x132842: llama_eval (llama.cpp:2268)
==5916==    by 0x10F7AC: main (main.cpp:295)
==5916==  Address 0x48 is not stack'd, malloc'd or (recently) free'd
==5916== 
==5916== 
==5916== Process terminating with default action of signal 11 (SIGSEGV)
==5916==  Access not within mapped region at address 0x48
==5916==    at 0x125071: ggml_mul_mat (ggml.c:4918)
==5916==    by 0x132162: llama_eval_internal(llama_context&, int const*, int, int, int) (llama.cpp:1137)
==5916==    by 0x132842: llama_eval (llama.cpp:2268)
==5916==    by 0x10F7AC: main (main.cpp:295)

This is easy to reproduce (just remove the limit, set n_batch to 2048, and use a big prompt).

With a n_batch of 1024 and no limit llama.cpp works fine with 2k+ length prompts, though with OpenBLAS I don't see a performance improvement in prompt ingestion (still around 150ms/token). So for me I don't see the need to support even larger n_batch sizes, though for GPU users it may be a different story.

@ggerganov
Copy link
Owner

Regarding the crash, see #1152 (comment)

@gjmulder
Copy link
Collaborator

gjmulder commented Apr 26, 2023

@eiery

With a n_batch of 1024 and no limit llama.cpp works fine with 2k+ length prompts, though with OpenBLAS I don't see a performance improvement in prompt ingestion (still around 150ms/token). So for me I don't see the need to support even larger n_batch sizes, though for GPU users it may be a different story.

Upvote here for n_batch of 1024 and higher, as I have 128GB of RAM and I was seeing a clear performance trend as reported in #1129 (comment)

@ghost
Copy link
Author

ghost commented May 1, 2023

@gjmulder From your comment it looks like you have a GPU available to test with. Could you run with a 1024 or higher batch size and see if it improves your results? Again on CPU it does nothing for me.

@gjmulder
Copy link
Collaborator

gjmulder commented May 1, 2023

@eiery ./perplexity is still reporting a batch size of 512:

$ git log | head -3
commit 7f15c5c477d9933689a9d1c40794483e350c2f19
Author: Georgi Gerganov <[email protected]>
Date:   Fri Apr 28 21:32:52 2023 +0300

$ ./perplexity -t 16 -m /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.GGML.q4_0.bin -c 512 -b 1024 -s 42 -f /data/llama/wikitext-2-raw/wiki.test.raw.406
main: seed = 42
llama.cpp: loading model from /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.GGML.q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required  = 42501.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size  = 1280.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 72 chunks, batch_size=512
^C

EDIT:

$ ldd ./perplexity | grep cuda
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007efbf9400000)
	libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007efbf9000000)
	libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x00007efbd6a00000)

@ghost
Copy link
Author

ghost commented May 1, 2023

For perplexity the batch size cannot be greater than the ctx size, therefore in your case it shows 512 as well. If you increase the ctx size to 1024 then it should work.

./perplexity -m models/llama-13B-ggml/ggml-model-llama13b-q4_0.bin -c 1024 -b 1024 -s 42
main: seed = 42
llama.cpp: loading model from models/llama-13B-ggml/ggml-model-llama13b-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  800.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 0 chunks, batch_size=1024

This is of course assuming that you have already patched common.cpp by removing the line params.n_batch = std::min(512, params.n_batch);. After doing so both generation and perplexity should work with more than 512 batch size.

@ghost
Copy link
Author

ghost commented May 2, 2023

Here are some CLBlast results on my HD530 iGPU, with the n_batch limit removed and the patch in #1152 (comment) used to get around the segfault. This is on 13B with a 2000 token prompt.

n_batch time (ms/token)
256 ~350
512 ~290
1024 ~270
2048 ~270

In this case performance plateaus at the 1024 n_batch mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants