Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

Closed
wants to merge 16 commits into from

Conversation

smspillaz
Copy link
Owner

@smspillaz smspillaz commented Aug 8, 2023

This is another API/ABI break

Previously we had to hardcode the runtime memory usage. Now GGML has a smarter allocator which can properly
estimate the graph memory usage. This requires some changes to ggml-gobject as well.

This also means that per-instance memory is no longer global - instead it is now per-cursor.

We'll use this type to encapsulate the notion of a per-instance memory
This is a stop-gap, since we'll be moving the execution buffer
elsewhere.
It got pulled in indirectly, but it should be used directly.
This is if you want to create a set of allocated weights
without creating the whole model
Basically instead of taking GBytes and having memory entangled with
the model weights, these are now separate concepts. So
create_model_desc also returns a GGMLLanguageModelDesc with
separate weight tree descriptions for the memory weights and
the model weights, where there will be one set of memory weights
per inference instance as opposed to per-model.
Lets not re-use the main computation context, as that needs to be
preserved in a special way.
This can be used to build the compute graph without actually
building the executing the graph, which can be useful for memory
allocation.
These are based on the newly added ggml_alloc context mode in
ggml, which use an allocator to specify graph memory layout as opposed
to the naive linear allocator. We can also use the allocator to compute
how much memory is actually required.

The general flow would be that you first run the forward pass with
worst-case inputs in "recorder" mode to compute a maximal memory usage
profile. The recorder mode sets tensor data addresses to a region
that doesn't exist in memory and also takes care to ensure that writes
to the tensor using the ggml-gobject API don't actually write anything
to memory.

Then afterwards, you can allocate a buffer of the required size, then
 use the allocator in "alloc" mode to create the same layout, this time
backed by a real buffer.

Using the alloc mode is fairly cheap, since it doesn't require
any system calls (all the memory is allocated upfront).
…mory size

With this we can finally remove the semi-hardcoded memory estimator
for GPT2 models and instead use a real estimate based on the model's
actual memory usage.
@smspillaz
Copy link
Owner Author

This is superceded #12

@smspillaz smspillaz closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant