ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

smspillaz · 2023-08-08T21:20:58Z

This is another API/ABI break

Previously we had to hardcode the runtime memory usage. Now GGML has a smarter allocator which can properly
estimate the graph memory usage. This requires some changes to ggml-gobject as well.

This also means that per-instance memory is no longer global - instead it is now per-cursor.

We'll use this type to encapsulate the notion of a per-instance memory

This is a stop-gap, since we'll be moving the execution buffer elsewhere.

It got pulled in indirectly, but it should be used directly.

This is if you want to create a set of allocated weights without creating the whole model

Basically instead of taking GBytes and having memory entangled with the model weights, these are now separate concepts. So create_model_desc also returns a GGMLLanguageModelDesc with separate weight tree descriptions for the memory weights and the model weights, where there will be one set of memory weights per inference instance as opposed to per-model.

Lets not re-use the main computation context, as that needs to be preserved in a special way.

This can be used to build the compute graph without actually building the executing the graph, which can be useful for memory allocation.

These are based on the newly added ggml_alloc context mode in ggml, which use an allocator to specify graph memory layout as opposed to the naive linear allocator. We can also use the allocator to compute how much memory is actually required. The general flow would be that you first run the forward pass with worst-case inputs in "recorder" mode to compute a maximal memory usage profile. The recorder mode sets tensor data addresses to a region that doesn't exist in memory and also takes care to ensure that writes to the tensor using the ggml-gobject API don't actually write anything to memory. Then afterwards, you can allocate a buffer of the required size, then use the allocator in "alloc" mode to create the same layout, this time backed by a real buffer. Using the alloc mode is fairly cheap, since it doesn't require any system calls (all the memory is allocated upfront).

…mory size With this we can finally remove the semi-hardcoded memory estimator for GPT2 models and instead use a real estimate based on the model's actual memory usage.

smspillaz · 2023-08-28T09:35:44Z

This is superceded #12

smspillaz added 16 commits August 4, 2023 20:51

ggml-gobject: Add GGMLExecutionMemory

8abea50

We'll use this type to encapsulate the notion of a per-instance memory

ggml-gpt: Expose estimate_memory_buffer_size

175617c

This is a stop-gap, since we'll be moving the execution buffer elsewhere.

ggml-language-model: Add missing ggml-cached-model include

41bc805

It got pulled in indirectly, but it should be used directly.

ggml-model: Add ggml_new_weight_set_for_flattened_desc

11aae67

This is if you want to create a set of allocated weights without creating the whole model

ggml-language-model: Create the execution memory on-demand

6bcc664

ggml-tensor: Allow out_n_bytes to be nullable

19bafaa

ggml-compute-plan: Allocate work tensor inside plan's own context

bcf5146

Lets not re-use the main computation context, as that needs to be preserved in a special way.

ggml-model: Split out ggml_model_build_compute_graph function

27317b8

This can be used to build the compute graph without actually building the executing the graph, which can be useful for memory allocation.

ggml-model: Use ggml_model_build_graph in ggml_model_forward

974d9d3

ggml-compute-graph: Add ggml_compute_graph_get_compute_size

fddf778

ggml-compute-plan: Also unref work tensor

0693edc

ggml-language-model: Use the recorder execution memory to estimate me…

59df66c

…mory size With this we can finally remove the semi-hardcoded memory estimator for GPT2 models and instead use a real estimate based on the model's actual memory usage.

tests: Increase max execution size a little

0f30426

testLoadGPT2: Do forward expansion of memory saves at the end

c756e04

smspillaz closed this Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

smspillaz commented Aug 8, 2023 •

edited

Loading

smspillaz commented Aug 28, 2023

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

Conversation

smspillaz commented Aug 8, 2023 • edited Loading

smspillaz commented Aug 28, 2023

smspillaz commented Aug 8, 2023 •

edited

Loading