llama : refactor llama_build_graph to reduce code duplication #3382

ggerganov · 2023-09-28T19:13:18Z

With the support of new model architectures, we start to observe a lot of repeating patterns in the code for building their compute graphs. We should find a way to refactor and reuse the repetitive code. We should also consider splitting the implementation in separate source files if necessary.

llama.cpp/llama.cpp

Lines 3997 to 4026 in 0e76a89

    
           static struct ggml_cgraph * llama_build_graph( 
        
                    llama_context & lctx, 
        
                const llama_batch & batch) { 
        
               const auto & model = lctx.model; 
        
               struct ggml_cgraph * result = NULL; 
        
               switch (model.arch) { 
        
                   case LLM_ARCH_LLAMA: 
        
                       { 
        
                           result = llm_build_llama(lctx, batch); 
        
                       } break; 
        
                   case LLM_ARCH_BAICHUAN: 
        
                       { 
        
                           result = llm_build_baichaun(lctx, batch); 
        
                       } break; 
        
                   case LLM_ARCH_FALCON: 
        
                       { 
        
                           result = llm_build_falcon(lctx, batch); 
        
                       } break; 
        
                   case LLM_ARCH_STARCODER: 
        
                       { 
        
                           result = llm_build_starcoder(lctx, batch); 
        
                       } break; 
        
                   default: 
        
                       GGML_ASSERT(false); 
        
               }; 
        
               return result; 
        
           }

Open to ideas and suggestions

ggerganov · 2023-10-03T19:41:43Z

After merging #3329 #3071 and #3417 we should put effort into resolving this issue

ggerganov · 2023-10-13T10:42:41Z

Something I am thinking we should consider in the scope of this issue is decoupling the llm_build_xxx() functions from the use of ggml_alloc and having a post-build step where we set the input data to the input tensors. I.e. we can break the following into 2 steps:

        // current llm_build_llama()
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        ggml_allocr_alloc(lctx.alloc, inp_tokens);
        if (!ggml_allocr_is_measure(lctx.alloc)) {
            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
        }
        ggml_set_name(inp_tokens, "inp_tokens");

        // ------

        // new llm_build_llama()
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        ggml_set_name(inp_tokens, "inp_tokens");

        // new llm_setup_llama()
        ggml_tensor * inp_tokens = ggml_get_tensor(ctx, "inp_tokens");
        ggml_allocr_alloc(lctx.alloc, inp_tokens);
        if (!ggml_allocr_is_measure(lctx.alloc)) {
            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
        }

Having build functions that do not rely on the state of the allocator would facilitate some things around estimating the required memory. (cc @slaren for thoughts)

slaren · 2023-10-13T10:55:32Z

I think it would be good to pre-allocate all the input and output tensors in a different buffer. In this way, these tensors would always be allocated and the calls to ggml_allocr_alloc and ggml_allocr_is_measure would not be necessary. Having the outputs pre-allocated would remove the hack of taking the results of the evaluation from the last two tensors of the graph.

This was already in the first version of ggml-backend, and I expect to do the same again for the current implementation of ggml-backend, once it is ready to be added to llama.cpp.

llama.cpp/llama.cpp

Lines 2631 to 2651 in d273bfd

    
           { 
        
               size_t buf_input_size = 0; 
        
               buf_input_size += hparams.n_ctx * ggml_type_size(GGML_TYPE_F32); // input tokens 
        
               // TODO: input embeddings should be optional to save memory 
        
               buf_input_size += hparams.n_embd * hparams.n_ctx * ggml_type_size(GGML_TYPE_F32); // input embeddings 
        
               ctx->buf_input = ggml_buffer_alloc(model->backend_inp, buf_input_size, 2); 
        
               struct ggml_init_params ggml_params = ggml_init_params_default(); 
        
               ggml_params.buffer = ctx->buf_input; 
        
               ggml_context * ctx0 = ggml_init(ggml_params); 
        
               ctx->graph_tokens_in = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, hparams.n_ctx); 
        
               ggml_set_name(ctx->graph_tokens_in, "tokens_in"); 
        
               ctx->graph_embeddings_in = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, hparams.n_embd, hparams.n_ctx); 
        
               ggml_set_name(ctx->graph_embeddings_in, "embeddings_in"); 
        
               ggml_free(ctx0); 
        
               fprintf(stderr, "%s: input buffer size:\n", __func__); 
        
               fprintf(stderr, "%8s = %7.2f MB\n", ggml_backend_name(ctx->buf_input->backend_buffer->backend), buf_input_size / 1024.0 / 1024.0); 
        
           }

jowi-dev · 2023-10-17T00:36:51Z

I don't write much cpp but I'm happy to take a stab at this

ggerganov added good first issue Good for newcomers refactoring Refactoring high priority Very important issue labels Sep 28, 2023

This was referenced Oct 11, 2023

Layer skipping/self-speculation demo #3565

Draft

Refactor graph building to reduce duplication #3591

Closed

This was referenced Nov 1, 2023

llama : refactor graph build code #3837

Merged

llm : add llm_build_context #3881

Merged

ggerganov closed this as completed Nov 1, 2023

ggerganov mentioned this issue Nov 2, 2023

StableLM support #3586

Merged

12 tasks

Galunid added a commit to Galunid/llama.cpp that referenced this issue Nov 5, 2023

Update after ggerganov#3382

c959376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor llama_build_graph to reduce code duplication #3382

llama : refactor llama_build_graph to reduce code duplication #3382

ggerganov commented Sep 28, 2023

ggerganov commented Oct 3, 2023

ggerganov commented Oct 13, 2023

slaren commented Oct 13, 2023

jowi-dev commented Oct 17, 2023

llama : refactor llama_build_graph to reduce code duplication #3382

llama : refactor llama_build_graph to reduce code duplication #3382

Comments

ggerganov commented Sep 28, 2023

ggerganov commented Oct 3, 2023

ggerganov commented Oct 13, 2023

slaren commented Oct 13, 2023

jowi-dev commented Oct 17, 2023