Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor llama_build_graph to reduce code duplication #3382

Closed
ggerganov opened this issue Sep 28, 2023 · 4 comments
Closed

llama : refactor llama_build_graph to reduce code duplication #3382

ggerganov opened this issue Sep 28, 2023 · 4 comments
Labels
good first issue Good for newcomers high priority Very important issue refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

With the support of new model architectures, we start to observe a lot of repeating patterns in the code for building their compute graphs. We should find a way to refactor and reuse the repetitive code. We should also consider splitting the implementation in separate source files if necessary.

llama.cpp/llama.cpp

Lines 3997 to 4026 in 0e76a89

static struct ggml_cgraph * llama_build_graph(
llama_context & lctx,
const llama_batch & batch) {
const auto & model = lctx.model;
struct ggml_cgraph * result = NULL;
switch (model.arch) {
case LLM_ARCH_LLAMA:
{
result = llm_build_llama(lctx, batch);
} break;
case LLM_ARCH_BAICHUAN:
{
result = llm_build_baichaun(lctx, batch);
} break;
case LLM_ARCH_FALCON:
{
result = llm_build_falcon(lctx, batch);
} break;
case LLM_ARCH_STARCODER:
{
result = llm_build_starcoder(lctx, batch);
} break;
default:
GGML_ASSERT(false);
};
return result;
}

Open to ideas and suggestions

@ggerganov ggerganov added good first issue Good for newcomers refactoring Refactoring high priority Very important issue labels Sep 28, 2023
@ggerganov
Copy link
Owner Author

After merging #3329 #3071 and #3417 we should put effort into resolving this issue

@ggerganov
Copy link
Owner Author

Something I am thinking we should consider in the scope of this issue is decoupling the llm_build_xxx() functions from the use of ggml_alloc and having a post-build step where we set the input data to the input tensors. I.e. we can break the following into 2 steps:

        // current llm_build_llama()
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        ggml_allocr_alloc(lctx.alloc, inp_tokens);
        if (!ggml_allocr_is_measure(lctx.alloc)) {
            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
        }
        ggml_set_name(inp_tokens, "inp_tokens");

        // ------

        // new llm_build_llama()
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        ggml_set_name(inp_tokens, "inp_tokens");

        // new llm_setup_llama()
        ggml_tensor * inp_tokens = ggml_get_tensor(ctx, "inp_tokens");
        ggml_allocr_alloc(lctx.alloc, inp_tokens);
        if (!ggml_allocr_is_measure(lctx.alloc)) {
            memcpy(inp_tokens->data, batch.token, n_tokens*ggml_element_size(inp_tokens));
        }

Having build functions that do not rely on the state of the allocator would facilitate some things around estimating the required memory. (cc @slaren for thoughts)

@slaren
Copy link
Collaborator

slaren commented Oct 13, 2023

I think it would be good to pre-allocate all the input and output tensors in a different buffer. In this way, these tensors would always be allocated and the calls to ggml_allocr_alloc and ggml_allocr_is_measure would not be necessary. Having the outputs pre-allocated would remove the hack of taking the results of the evaluation from the last two tensors of the graph.

This was already in the first version of ggml-backend, and I expect to do the same again for the current implementation of ggml-backend, once it is ready to be added to llama.cpp.

llama.cpp/llama.cpp

Lines 2631 to 2651 in d273bfd

{
size_t buf_input_size = 0;
buf_input_size += hparams.n_ctx * ggml_type_size(GGML_TYPE_F32); // input tokens
// TODO: input embeddings should be optional to save memory
buf_input_size += hparams.n_embd * hparams.n_ctx * ggml_type_size(GGML_TYPE_F32); // input embeddings
ctx->buf_input = ggml_buffer_alloc(model->backend_inp, buf_input_size, 2);
struct ggml_init_params ggml_params = ggml_init_params_default();
ggml_params.buffer = ctx->buf_input;
ggml_context * ctx0 = ggml_init(ggml_params);
ctx->graph_tokens_in = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, hparams.n_ctx);
ggml_set_name(ctx->graph_tokens_in, "tokens_in");
ctx->graph_embeddings_in = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, hparams.n_embd, hparams.n_ctx);
ggml_set_name(ctx->graph_embeddings_in, "embeddings_in");
ggml_free(ctx0);
fprintf(stderr, "%s: input buffer size:\n", __func__);
fprintf(stderr, "%8s = %7.2f MB\n", ggml_backend_name(ctx->buf_input->backend_buffer->backend), buf_input_size / 1024.0 / 1024.0);
}

@jowi-dev
Copy link

I don't write much cpp but I'm happy to take a stab at this

@ggerganov ggerganov mentioned this issue Nov 2, 2023
12 tasks
Galunid added a commit to Galunid/llama.cpp that referenced this issue Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers high priority Very important issue refactoring Refactoring
Projects
Status: Done
Development

No branches or pull requests

3 participants