llama : refactor graph build code #3837

ggerganov · 2023-10-28T16:55:35Z

ref #3382

Graph build functions no longer depend on the allocator
Move GPU offloading outside of build functions via callback
GPU offload for Bloom arch (this is a positive side-effect of the refactoring)
Offload result_norm when embeddings are not requested
~~Offload tensor with token positions inp_pos as non-repeating (NR)~~
~~Change offload type of KQ_mask, KQ_shift and K_shifted from KQ to NR. Any reason for these to be KQ?~~
Helper functions to build graphs: llama : add llm_build helper functions #3848

TODO:

Test arches with CUDA - might have broken something during the refactoring

ggml-ci

llama.cpp

ggml-ci

* llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper

ggerganov · 2023-10-31T18:25:26Z

Planning to merge this soon. It will be miracle if I didn't break something, but I think the refactoring should make things a bit easier with adding new model arches in the future.

jxy · 2023-11-01T20:40:14Z

This change broke Falcon.

cebtenzzre · 2023-11-01T21:02:44Z

This change broke Falcon.

Could you be more specific? Falcon 7B appears to still work for me on both CPU and CUDA.

ggerganov · 2023-11-01T21:10:31Z

The attention norm for Falcon-40B was using wrong input tensor. Should be fixed with 523e49b

Galunid · 2023-11-06T07:45:37Z

I can confirm Persimmon is broken

GGML_ASSERT: ggml.c:3113: ggml_can_repeat_rows(b, a)

Program received signal SIGABRT, Aborted.
0x00007ffff7aac83c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7aac83c in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7a5c668 in raise () from /usr/lib/libc.so.6
#2  0x00007ffff7a444b8 in abort () from /usr/lib/libc.so.6
#3  0x00005555555792a1 in ggml_add_impl (ctx=0x5555557d2888 <g_state+200>, a=0x7ffdf9209a90, b=0x7ffdf91fe690, inplace=false) at ggml.c:3113
#4  0x00005555555793dd in ggml_add (ctx=0x5555557d2888 <g_state+200>, a=0x7ffdf9209a90, b=0x7ffdf91fe690) at ggml.c:3137
#5  0x00005555555cd88c in llm_build_kqv (ctx=0x5555557d2888 <g_state+200>, hparams=..., kv=..., wo=0x5555578081c0, wo_b=0x555557808340, q_cur=0x7ffdf9208710, kq_scale=0x7ffdf91fe510, kq_mask=0x7ffdf91fe690, n_ctx=4096, n_tokens=1024, 
    n_kv=4096, max_alibi_bias=-1, cb=..., il=0) at llama.cpp:3454
#6  0x00005555555f1aaf in llm_build_context::build_persimmon (this=0x7fffffffa6d0) at llama.cpp:4198
#7  0x00005555555ceb1f in llama_build_graph (lctx=..., batch=...) at llama.cpp:4983
#8  0x00005555555d9aed in llama_new_context_with_model (model=0x555555870bb0, params=...) at llama.cpp:8182
#9  0x0000555555649af6 in llama_init_from_gpt_params (params=...) at common/common.cpp:955
#10 0x0000555555560452 in main (argc=23, argv=0x7fffffffdb68) at examples/main/main.cpp:180

I'm uploading Q4_K model to https://huggingface.co/Galunid/persimmon-gguf/tree/main

I tested it with 238657d and it was working, 71e3718 does not work

Dunno if I should create separate issue for this

* undoing more semantic renames in ggerganov/llama.cpp#3837

* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (ggerganov#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (ggerganov#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper

* undoing more semantic renames in ggerganov/llama.cpp#3837

ggerganov added 6 commits October 28, 2023 19:54

llama : factor out ggml-alloc from graph graph build functions

8b2420d

ggml-ci

metal : disable kernel load log

5946d98

llama : factor out tensor offloading outside the build call (wip)

38aca9e

ggml-ci

llama : offload rest of the models

83d2c43

ggml-ci

llama : update offload log messages to print node index

3af8771

llama : comments

51c4f9e

cebtenzzre reviewed Oct 28, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

slaren reviewed Oct 29, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ggerganov added 5 commits October 29, 2023 07:36

llama : support offloading result_norm + comments

4e98897

llama : factor graph input into a function

0dc05b8

llama : do tensor offload only with CUDA

e14aa46

llama : fix res_norm offloading

7961790

llama : try to optimize offloading code

b4ad03b

ggerganov force-pushed the llama-refactor branch from 66a54bf to b4ad03b Compare October 29, 2023 09:00

ggerganov added 8 commits October 29, 2023 11:12

llama : fix non-CUDA build

25cfbf6

llama : try to fix build

739b85c

llama : move refact in correct place + optimize graph input

da93618

llama : refactor tensor offloading as callback

1e9c544

llama : add layer index to all tensor names

8925cf9

llama : add functional header

7610879

llama : comment

79ad734

ggml-ci

llama : remove obsolete map for layer counting

210e6e5

ggerganov added the refactoring Refactoring label Oct 29, 2023

ggerganov mentioned this pull request Oct 30, 2023

[wip] ggml-backend v2 : add ggml_backend_sched ggerganov/ggml#586

Merged

ggerganov marked this pull request as ready for review October 31, 2023 18:24

Merge branch 'master' into llama-refactor

afb3929

ggerganov merged commit 71e3718 into master Nov 1, 2023
33 checks passed

ggerganov added a commit that referenced this pull request Nov 1, 2023

llm : fix falcon norm after refactoring (#3837)

523e49b

ggerganov added a commit that referenced this pull request Nov 1, 2023

llm : fix llm_build_kqv taking unused tensor (benign, #3837)

c43c2da

Galunid mentioned this pull request Nov 7, 2023

Generalize convert scripts #3838

Merged

19 tasks

Galunid added a commit to Galunid/llama.cpp that referenced this pull request Nov 9, 2023

Unbreak persimmon after ggerganov#3837

2a2c518

Galunid mentioned this pull request Nov 9, 2023

Unbreak persimmon after #3837 #4010

Merged

Galunid added a commit that referenced this pull request Nov 10, 2023

Unbreak persimmon after #3837 (#4010)

df9d129

ggerganov mentioned this pull request Nov 16, 2023

add PLaMo model #3557

Merged

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023

Make stablelm support compatible with pre-layer refactor

b59ca76

* undoing more semantic renames in ggerganov/llama.cpp#3837

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 18, 2023

Make stablelm support compatible with pre-layer refactor

6c0c7df

* undoing more semantic renames in ggerganov/llama.cpp#3837

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

llm : fix falcon norm after refactoring (ggerganov#3837)

fd87440

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

llm : fix llm_build_kqv taking unused tensor (benign, ggerganov#3837)

c09822b

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

Unbreak persimmon after ggerganov#3837 (ggerganov#4010)

d81dc32

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 30, 2023

Make stablelm support compatible with pre-layer refactor

b0b9346

* undoing more semantic renames in ggerganov/llama.cpp#3837

yuxx0218 mentioned this pull request Dec 26, 2023

didn't use gpu SJTU-IPADS/PowerInfer#81

Closed

rikoras mentioned this pull request Mar 27, 2024

Unable to generate constant output SJTU-IPADS/PowerInfer#175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor graph build code #3837

llama : refactor graph build code #3837

ggerganov commented Oct 28, 2023 •

edited

Loading

ggerganov commented Oct 31, 2023

jxy commented Nov 1, 2023

cebtenzzre commented Nov 1, 2023

ggerganov commented Nov 1, 2023

Galunid commented Nov 6, 2023 •

edited

Loading

llama : refactor graph build code #3837

llama : refactor graph build code #3837

Conversation

ggerganov commented Oct 28, 2023 • edited Loading

ggerganov commented Oct 31, 2023

jxy commented Nov 1, 2023

cebtenzzre commented Nov 1, 2023

ggerganov commented Nov 1, 2023

Galunid commented Nov 6, 2023 • edited Loading

ggerganov commented Oct 28, 2023 •

edited

Loading

Galunid commented Nov 6, 2023 •

edited

Loading