Have n_batch default to 512 when BLAS is enabled #1091

ghost · 2023-04-20T21:17:05Z

As GGML only uses BLAS when n_batch is 32 or larger it is not used by default even if llama.cpp is compiled with a BLAS lib. This is because the default n_batch is 8. My path sets n_batch to the maximum 512 when BLAS is enabled at compile-time and keeps at at 8 if there is no BLAS.

Personally I see the best performance with OpenBLAS and a size of 512, so this is why I chose this value. Experimentation may be needed to come up with a good default (as long as it's larger than 32).

This came out of a discussion at #1065 (comment).

slaren · 2023-04-20T21:42:46Z

It may be better to check for BLAS availability with ggml_cpu_has_blas() instead.

ghost · 2023-04-20T22:42:05Z

Yeah that would also work. It really is the preference between hardcoding this default in the struct itself or bumping it up manually after instantiating gpt_params. I'll let @ggerganov comment.

slaren · 2023-04-20T22:44:44Z

The initializer doesn't have to be constant, you can do something like this:

int32_t n_batch       = ggml_cpu_has_blas() ? 512 : 8;    // batch size for prompt processing

ghost · 2023-04-21T01:04:25Z

The initializer doesn't have to be constant, you can do something like this:
int32_t n_batch       = ggml_cpu_has_blas() ? 512 : 8;    // batch size for prompt processing

Aargh, I'm getting rusty 😉

94cb00a is the alternate implementation with ggml_cpu_has_blas() and c6dfc44 is the original. The alternate version has the advantage of not requiring a Makefile update.

ghost · 2023-04-21T01:45:48Z

Here's some prompt eval time results with q4_0 13B Llama, OpenBLAS, and my standard 320 token prompt. I ran this on Linux with a 16GB i5-6500. n_batch of 512 is a clear winner in my environment.

n_batch	time (ms/token)
8 (no BLAS)	~270
32	~415
128	~200
256	~170
512	~145

slaren · 2023-04-21T01:47:42Z

512 is also good for cuBLAS.

ggerganov · 2023-04-21T15:28:45Z

Wonder if we should use 512 even without BLAS.
Any reason not to do it?

Adding ggml.h to common.h is not very desirable. I think we should just simplify it, unless it is not a good default for some use cases.

DannyDaemonic · 2023-04-21T19:29:08Z

From a user experience point of view, the smaller batch sizes let you know where you are in your initial prompt evaluation as each batch is written out right before evaluation. Personally I'd happily trade that away for quicker processing but I do see how it might be nice to watch it churn through your prompt on those slow larger models.

I doubt anyone would make anything so intricate but it might be nice if there were some sort of callback to give you a percentage update for anyone who wants to make a GUI and wants to use larger batch sizes to utilize BLAS. (They could probably just evaluate a single token or two and then estimate the time to do a block of 512.)

ghost · 2023-04-22T04:13:55Z

b2e8a32 has n_batch of 512 set by default regardless if BLAS is enabled or not.

In my tests I saw a prompt eval time of around 260-270ms/token regardless of batch size without BLAS, using the same setup as my previous tests. I tried batch sizes between 8 and 512. Thus in my case the batch size does not matter.

As an aside I'm curious as to why n_batch is hard limited to 512 - is there a technical reason why we can't use larger values?

ggerganov · 2023-04-22T08:26:39Z

As an aside I'm curious as to why n_batch is hard limited to 512 - is there a technical reason why we can't use larger values?

I think going above 512 will have to increase some of the buffer sizes in llama.cpp:

llama.cpp/llama.cpp

Lines 45 to 97 in 872c365

    
           static const size_t MB = 1024*1024; 
        
           // computed for n_ctx == 2048 
        
           // TODO: dynamically determine these sizes 
        
           //       needs modifications in ggml 
        
           static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0() 
        
           { 
        
               static std::map<e_model, size_t> _MEM_REQ_SCRATCH0 = { 
        
                   { MODEL_7B,    512ull * MB }, 
        
                   { MODEL_13B,   512ull * MB }, 
        
                   { MODEL_30B,   512ull * MB }, 
        
                   { MODEL_65B,   512ull * MB }, 
        
               }; 
        
               return _MEM_REQ_SCRATCH0; 
        
           } 
        
           static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1() 
        
           { 
        
               static std::map<e_model, size_t> _MEM_REQ_SCRATCH1 = { 
        
                   { MODEL_7B,    512ull * MB }, 
        
                   { MODEL_13B,   512ull * MB }, 
        
                   { MODEL_30B,   512ull * MB }, 
        
                   { MODEL_65B,   512ull * MB }, 
        
               }; 
        
               return _MEM_REQ_SCRATCH1; 
        
           } 
        
           // 2*n_embd*n_ctx*n_layer*sizeof(float16) 
        
           static const std::map<e_model, size_t> & MEM_REQ_KV_SELF() 
        
           { 
        
               static std::map<e_model, size_t> _MEM_REQ_KV_SELF = { 
        
                   { MODEL_7B,   1026ull * MB }, 
        
                   { MODEL_13B,  1608ull * MB }, 
        
                   { MODEL_30B,  3124ull * MB }, 
        
                   { MODEL_65B,  5120ull * MB }, 
        
               }; 
        
               return _MEM_REQ_KV_SELF; 
        
           } 
        
           // this is mostly needed for temporary mul_mat buffers to dequantize the data 
        
           // not actually needed if BLAS is disabled 
        
           static const std::map<e_model, size_t> & MEM_REQ_EVAL() 
        
           { 
        
               static std::map<e_model, size_t> _MEM_REQ_EVAL = { 
        
                   { MODEL_7B,   768ull * MB }, 
        
                   { MODEL_13B, 1024ull * MB }, 
        
                   { MODEL_30B, 1280ull * MB }, 
        
                   { MODEL_65B, 1536ull * MB }, 
        
               }; 
        
               return _MEM_REQ_EVAL; 
        
           }

But not 100% sure.
If you give it a try with n_batch == 2048 and a very large prompt and it does not crash, we can remove this restriction

ghost · 2023-04-24T23:14:57Z

Yeah it segfaults alright with the 512 limit removed, n_batch of 2048, and a 2k+ length prompt. The same prompt has no issues with a n_batch of 512.

ggml_new_tensor_impl: not enough space in the scratch memory
<REDACTED>.sh: line 6:  5156 Segmentation fault      ./main -m <REDACTED>/ggml-model-llama13b-q4_0.bin -c 2048 -n 1 --keep -1 -s 1 --repeat_penalty 1.1 --top_k 0 --top_p 0.73 --temp 0.72 --color -f <REDACTED>.txt -b 2048

In valgrind:

ggml_new_tensor_impl: not enough space in the scratch memory
==5916== Invalid write of size 4
==5916==    at 0x125071: ggml_mul_mat (ggml.c:4918)
==5916==    by 0x132162: llama_eval_internal(llama_context&, int const*, int, int, int) (llama.cpp:1137)
==5916==    by 0x132842: llama_eval (llama.cpp:2268)
==5916==    by 0x10F7AC: main (main.cpp:295)
==5916==  Address 0x48 is not stack'd, malloc'd or (recently) free'd
==5916== 
==5916== 
==5916== Process terminating with default action of signal 11 (SIGSEGV)
==5916==  Access not within mapped region at address 0x48
==5916==    at 0x125071: ggml_mul_mat (ggml.c:4918)
==5916==    by 0x132162: llama_eval_internal(llama_context&, int const*, int, int, int) (llama.cpp:1137)
==5916==    by 0x132842: llama_eval (llama.cpp:2268)
==5916==    by 0x10F7AC: main (main.cpp:295)

This is easy to reproduce (just remove the limit, set n_batch to 2048, and use a big prompt).

With a n_batch of 1024 and no limit llama.cpp works fine with 2k+ length prompts, though with OpenBLAS I don't see a performance improvement in prompt ingestion (still around 150ms/token). So for me I don't see the need to support even larger n_batch sizes, though for GPU users it may be a different story.

ggerganov · 2023-04-25T04:46:02Z

Regarding the crash, see #1152 (comment)

gjmulder · 2023-04-26T08:37:24Z

@eiery

With a n_batch of 1024 and no limit llama.cpp works fine with 2k+ length prompts, though with OpenBLAS I don't see a performance improvement in prompt ingestion (still around 150ms/token). So for me I don't see the need to support even larger n_batch sizes, though for GPU users it may be a different story.

Upvote here for n_batch of 1024 and higher, as I have 128GB of RAM and I was seeing a clear performance trend as reported in #1129 (comment)

ghost · 2023-05-01T01:20:34Z

@gjmulder From your comment it looks like you have a GPU available to test with. Could you run with a 1024 or higher batch size and see if it improves your results? Again on CPU it does nothing for me.

gjmulder · 2023-05-01T10:20:43Z

@eiery ./perplexity is still reporting a batch size of 512:

$ git log | head -3
commit 7f15c5c477d9933689a9d1c40794483e350c2f19
Author: Georgi Gerganov <[email protected]>
Date:   Fri Apr 28 21:32:52 2023 +0300

$ ./perplexity -t 16 -m /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.GGML.q4_0.bin -c 512 -b 1024 -s 42 -f /data/llama/wikitext-2-raw/wiki.test.raw.406
main: seed = 42
llama.cpp: loading model from /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.GGML.q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required  = 42501.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size  = 1280.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 72 chunks, batch_size=512
^C

EDIT:

$ ldd ./perplexity | grep cuda
	libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x00007efbf9400000)
	libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007efbf9000000)
	libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x00007efbd6a00000)

ghost · 2023-05-01T20:39:13Z

For perplexity the batch size cannot be greater than the ctx size, therefore in your case it shows 512 as well. If you increase the ctx size to 1024 then it should work.

./perplexity -m models/llama-13B-ggml/ggml-model-llama13b-q4_0.bin -c 1024 -b 1024 -s 42
main: seed = 42
llama.cpp: loading model from models/llama-13B-ggml/ggml-model-llama13b-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  800.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 0 chunks, batch_size=1024

This is of course assuming that you have already patched common.cpp by removing the line params.n_batch = std::min(512, params.n_batch);. After doing so both generation and perplexity should work with more than 512 batch size.

ghost · 2023-05-02T01:42:34Z

Here are some CLBlast results on my HD530 iGPU, with the n_batch limit removed and the patch in #1152 (comment) used to get around the segfault. This is on 13B with a 2000 token prompt.

n_batch	time (ms/token)
256	~350
512	~290
1024	~270
2048	~270

In this case performance plateaus at the 1024 n_batch mark.

eiery added 2 commits April 20, 2023 17:04

set default n_batch to 512 when using BLAS

4b781c2

spacing

c6dfc44

alternate implementation of setting different n_batch for BLAS

94cb00a

Green-Sky approved these changes Apr 21, 2023

View reviewed changes

eiery added 2 commits April 22, 2023 00:02

Merge branch 'ggerganov:master' into master

131159f

set n_batch to 512 for all cases

b2e8a32

ggerganov merged commit 10f19c1 into ggerganov:master Apr 22, 2023

ghost mentioned this pull request Apr 24, 2023

Update n_batch default to 512 to match upstream llama.cpp abetlen/llama-cpp-python#108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have n_batch default to 512 when BLAS is enabled #1091

Have n_batch default to 512 when BLAS is enabled #1091

ghost commented Apr 20, 2023

slaren commented Apr 20, 2023

ghost commented Apr 20, 2023

slaren commented Apr 20, 2023

ghost commented Apr 21, 2023

ghost commented Apr 21, 2023

slaren commented Apr 21, 2023

ggerganov commented Apr 21, 2023 •

edited

Loading

DannyDaemonic commented Apr 21, 2023

ghost commented Apr 22, 2023 •

edited by ghost

Loading

ggerganov commented Apr 22, 2023

ghost commented Apr 24, 2023

ggerganov commented Apr 25, 2023

gjmulder commented Apr 26, 2023 •

edited

Loading

ghost commented May 1, 2023

gjmulder commented May 1, 2023 •

edited

Loading

ghost commented May 1, 2023

ghost commented May 2, 2023

Have n_batch default to 512 when BLAS is enabled #1091

Have n_batch default to 512 when BLAS is enabled #1091

Conversation

ghost commented Apr 20, 2023

slaren commented Apr 20, 2023

ghost commented Apr 20, 2023

slaren commented Apr 20, 2023

ghost commented Apr 21, 2023

ghost commented Apr 21, 2023

slaren commented Apr 21, 2023

ggerganov commented Apr 21, 2023 • edited Loading

DannyDaemonic commented Apr 21, 2023

ghost commented Apr 22, 2023 • edited by ghost Loading

ggerganov commented Apr 22, 2023

ghost commented Apr 24, 2023

ggerganov commented Apr 25, 2023

gjmulder commented Apr 26, 2023 • edited Loading

ghost commented May 1, 2023

gjmulder commented May 1, 2023 • edited Loading

ghost commented May 1, 2023

ghost commented May 2, 2023

ggerganov commented Apr 21, 2023 •

edited

Loading

ghost commented Apr 22, 2023 •

edited by ghost

Loading

gjmulder commented Apr 26, 2023 •

edited

Loading

gjmulder commented May 1, 2023 •

edited

Loading