-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11
Commits on Aug 4, 2023
-
ggml-gobject: Add GGMLExecutionMemory
We'll use this type to encapsulate the notion of a per-instance memory
Configuration menu - View commit details
-
Copy full SHA for 8abea50 - Browse repository at this point
Copy the full SHA 8abea50View commit details -
ggml-gpt: Expose estimate_memory_buffer_size
This is a stop-gap, since we'll be moving the execution buffer elsewhere.
Configuration menu - View commit details
-
Copy full SHA for 175617c - Browse repository at this point
Copy the full SHA 175617cView commit details -
ggml-language-model: Add missing ggml-cached-model include
It got pulled in indirectly, but it should be used directly.
Configuration menu - View commit details
-
Copy full SHA for 41bc805 - Browse repository at this point
Copy the full SHA 41bc805View commit details -
ggml-model: Add ggml_new_weight_set_for_flattened_desc
This is if you want to create a set of allocated weights without creating the whole model
Configuration menu - View commit details
-
Copy full SHA for 11aae67 - Browse repository at this point
Copy the full SHA 11aae67View commit details -
ggml-gobject: API break - we take a GGMLExecutionMemory now
Basically instead of taking GBytes and having memory entangled with the model weights, these are now separate concepts. So create_model_desc also returns a GGMLLanguageModelDesc with separate weight tree descriptions for the memory weights and the model weights, where there will be one set of memory weights per inference instance as opposed to per-model.
Configuration menu - View commit details
-
Copy full SHA for e24bcdb - Browse repository at this point
Copy the full SHA e24bcdbView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6bcc664 - Browse repository at this point
Copy the full SHA 6bcc664View commit details
Commits on Aug 8, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 19bafaa - Browse repository at this point
Copy the full SHA 19bafaaView commit details -
ggml-compute-plan: Allocate work tensor inside plan's own context
Lets not re-use the main computation context, as that needs to be preserved in a special way.
Configuration menu - View commit details
-
Copy full SHA for bcf5146 - Browse repository at this point
Copy the full SHA bcf5146View commit details -
ggml-model: Split out ggml_model_build_compute_graph function
This can be used to build the compute graph without actually building the executing the graph, which can be useful for memory allocation.
Configuration menu - View commit details
-
Copy full SHA for 27317b8 - Browse repository at this point
Copy the full SHA 27317b8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 974d9d3 - Browse repository at this point
Copy the full SHA 974d9d3View commit details -
Configuration menu - View commit details
-
Copy full SHA for fddf778 - Browse repository at this point
Copy the full SHA fddf778View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0693edc - Browse repository at this point
Copy the full SHA 0693edcView commit details -
ggml-execution-memory: Add "alloc" and "recorder" modes
These are based on the newly added ggml_alloc context mode in ggml, which use an allocator to specify graph memory layout as opposed to the naive linear allocator. We can also use the allocator to compute how much memory is actually required. The general flow would be that you first run the forward pass with worst-case inputs in "recorder" mode to compute a maximal memory usage profile. The recorder mode sets tensor data addresses to a region that doesn't exist in memory and also takes care to ensure that writes to the tensor using the ggml-gobject API don't actually write anything to memory. Then afterwards, you can allocate a buffer of the required size, then use the allocator in "alloc" mode to create the same layout, this time backed by a real buffer. Using the alloc mode is fairly cheap, since it doesn't require any system calls (all the memory is allocated upfront).
Configuration menu - View commit details
-
Copy full SHA for c273790 - Browse repository at this point
Copy the full SHA c273790View commit details -
ggml-language-model: Use the recorder execution memory to estimate me…
…mory size With this we can finally remove the semi-hardcoded memory estimator for GPT2 models and instead use a real estimate based on the model's actual memory usage.
Configuration menu - View commit details
-
Copy full SHA for 59df66c - Browse repository at this point
Copy the full SHA 59df66cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0f30426 - Browse repository at this point
Copy the full SHA 0f30426View commit details -
Configuration menu - View commit details
-
Copy full SHA for c756e04 - Browse repository at this point
Copy the full SHA c756e04View commit details