service: Move service code into separate library #12

smspillaz · 2023-08-28T08:52:45Z

We now have a libggml-client which an application can link to in order to facilitate the process of opening dbus private connections etc. It doesn't have any dependencies on ggml proper, in case there's some reason why the client can't link to libggml.

We'll use this type to encapsulate the notion of a per-instance memory

This is a stop-gap, since we'll be moving the execution buffer elsewhere.

It got pulled in indirectly, but it should be used directly.

This is if you want to create a set of allocated weights without creating the whole model

Basically instead of taking GBytes and having memory entangled with the model weights, these are now separate concepts. So create_model_desc also returns a GGMLLanguageModelDesc with separate weight tree descriptions for the memory weights and the model weights, where there will be one set of memory weights per inference instance as opposed to per-model.

Lets not re-use the main computation context, as that needs to be preserved in a special way.

This can be used to build the compute graph without actually building the executing the graph, which can be useful for memory allocation.

These are based on the newly added ggml_alloc context mode in ggml, which use an allocator to specify graph memory layout as opposed to the naive linear allocator. We can also use the allocator to compute how much memory is actually required. The general flow would be that you first run the forward pass with worst-case inputs in "recorder" mode to compute a maximal memory usage profile. The recorder mode sets tensor data addresses to a region that doesn't exist in memory and also takes care to ensure that writes to the tensor using the ggml-gobject API don't actually write anything to memory. Then afterwards, you can allocate a buffer of the required size, then use the allocator in "alloc" mode to create the same layout, this time backed by a real buffer. Using the alloc mode is fairly cheap, since it doesn't require any system calls (all the memory is allocated upfront).

…mory size With this we can finally remove the semi-hardcoded memory estimator for GPT2 models and instead use a real estimate based on the model's actual memory usage.

… ModelService

…ate library This vastly simplifies the reference implementation (the reference implementation had to handle all sorts of details like creating unix pipe connnections for a private dbus session, etc).

It also manages its own hash table. Once the ref count drops to zero, then we also remove from the hash table.

We'll use this later to handle the terminate() edge case

This should terminate any completions and cause the object to drop off the bus (plus release any resources that are held on the side of the service).

We don't get any signal when the user types into the buffer normally (from insert-at-cursor).

We'll lazy-load on the next inference anyway.

The user can now select between local/dbus usage

The thread func expects to consume the entire state, so we have to steal the pointer from the autoptr.

This is only because GCClosure requires marshalling in/out the function arguments and return value. In our case, type erasure will do - we only really have this so that we can pass closures as object properties.

We'll use this to override the behaviour of sampling from the logits of an unrolled language model.

…ridden

This can be used to inject a little more randomness into the sampling procedure, in line with what's done in GGML proper.

…ds to not be set

smspillaz added 30 commits August 4, 2023 20:51

ggml-gobject: Add GGMLExecutionMemory

8abea50

We'll use this type to encapsulate the notion of a per-instance memory

ggml-gpt: Expose estimate_memory_buffer_size

175617c

This is a stop-gap, since we'll be moving the execution buffer elsewhere.

ggml-language-model: Add missing ggml-cached-model include

41bc805

It got pulled in indirectly, but it should be used directly.

ggml-model: Add ggml_new_weight_set_for_flattened_desc

11aae67

This is if you want to create a set of allocated weights without creating the whole model

ggml-language-model: Create the execution memory on-demand

6bcc664

ggml-tensor: Allow out_n_bytes to be nullable

19bafaa

ggml-compute-plan: Allocate work tensor inside plan's own context

bcf5146

Lets not re-use the main computation context, as that needs to be preserved in a special way.

ggml-model: Split out ggml_model_build_compute_graph function

27317b8

This can be used to build the compute graph without actually building the executing the graph, which can be useful for memory allocation.

ggml-model: Use ggml_model_build_graph in ggml_model_forward

974d9d3

ggml-compute-graph: Add ggml_compute_graph_get_compute_size

fddf778

ggml-compute-plan: Also unref work tensor

0693edc

ggml-language-model: Use the recorder execution memory to estimate me…

59df66c

…mory size With this we can finally remove the semi-hardcoded memory estimator for GPT2 models and instead use a real estimate based on the model's actual memory usage.

tests: Increase max execution size a little

0f30426

testLoadGPT2: Do forward expansion of memory saves at the end

c756e04

service: Rename object paths to /org/ggml and LanguageModelService to…

3849533

… ModelService

dbus: Simplify names of interfaces and object paths

8d536e7

dbus: Move the dbus code into a separate library (libggml-client)

023d2ea

service: Add missing early return on error path

6f9fa8f

service-client: Rename file in header

01f888a

ggml-client: Split out all the client side dbus handling into a separ…

978de62

…ate library This vastly simplifies the reference implementation (the reference implementation had to handle all sorts of details like creating unix pipe connnections for a private dbus session, etc).

service: Change the model ref to be an actual object

953d3a0

It also manages its own hash table. Once the ref count drops to zero, then we also remove from the hash table.

service: Create a cancellable for exec()

789603c

We'll use this later to handle the terminate() edge case

service: Handle Terminate() on the cursor

5a62176

This should terminate any completions and cause the object to drop off the bus (plus release any resources that are held on the side of the service).

ggml-language-model: Don't execute on cursor if we're already executing

d58f074

tests/js: Add test for multiple execution being an error

e1dc508

service: Fixup for terminate cursor

88625da

ggml-client: Call terminate() when a cursor's ref count drops to zero

9fed193

llm-writer-app: Separate cursor management into manager class

f9fcf1c

smspillaz added 28 commits August 12, 2023 15:36

llm-writer-app: Also invalidate the cursor when buffer changes

18ac765

We don't get any signal when the user types into the buffer normally (from insert-at-cursor).

ggml-client: Rework cursor API to resemble that of the local API

6cddd61

ggml-language-model: Also free execution state on thread loop done

3cc1339

ggml-client: Remove excess messages

cd87f54

llm-writer-app: Just abort prediction if we change the model type

2c1662c

We'll lazy-load on the next inference anyway.

llm-writer-app: Proof-of-concept for using service as opposed to library

88ff17d

readme: We no longer have so much hardcoding

8c91163

readme: Add a note about C tests

cb424dd

readme: Add some notes on usage

979c2ae

llm-writer-app: Keep note of destroyed state for cusor managers

8065dcf

llm-writer-app: Move settings into popover, add radio for usage type

39782c9

The user can now select between local/dbus usage

ggml-language-model: transfer full the state to the thread func

a6a65a8

The thread func expects to consume the entire state, so we have to steal the pointer from the autoptr.

ggml-gobject: Add GGMLClosure helper boxed type

5dfe4d6

This is only because GCClosure requires marshalling in/out the function arguments and return value. In our case, type erasure will do - we only really have this so that we can pass closures as object properties.

ggml-gobject: Add language model sampler class

a94e399

We'll use this to override the behaviour of sampling from the logits of an unrolled language model.

ggml-language-model: Use Argmax sampler, but allow sampler to be over…

ad9709e

…ridden

ggml-language-model-sampler: Add GGMLTopKTopPLanguageModelSampler

3977517

This can be used to inject a little more randomness into the sampling procedure, in line with what's done in GGML proper.

llm-writer-app: Use GGMLTopKTopPSampler

7f90539

ggml-gobject: Add language model sampler imports to toplevel import

813ebbe

service: Allow top-k/top-p properties to be passed to service

aac52ba

ggml-language-model-sampler: Don't set seed property on construction

3adf41c

ggml-client: Allow passing of additional properties to CreateCompletion

daba40a

ggml-language-model-service: Fix wrong order of arguments causing see…

1b01bee

…ds to not be set

ggml-language-model-client: Allow passing options to client

b544141

llm-writer-app: Set top-k/top-p properties for dbus completions too

98bb52c

ggml: Add eps parameter to ggml_op_norm

f66e5cd

ggml-ops: Add out-of-place versions of softmax/diag_mask_inf

c0182d9

gpt: Use out-of-place softmax/diag_mask_inplace inline with example

0397ce1

workflow: Update ggml ref

b5028b5

smspillaz mentioned this pull request Aug 28, 2023

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

Closed

smspillaz merged commit d91d9db into master Aug 28, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

service: Move service code into separate library #12

service: Move service code into separate library #12

smspillaz commented Aug 28, 2023

service: Move service code into separate library #12

service: Move service code into separate library #12

Conversation

smspillaz commented Aug 28, 2023