Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-gobject: Add concept of execution memory and remove hardcoded memory estimates #11

Closed
wants to merge 16 commits into from

Commits on Aug 4, 2023

  1. ggml-gobject: Add GGMLExecutionMemory

    We'll use this type to encapsulate the notion of a per-instance memory
    smspillaz committed Aug 4, 2023
    Configuration menu
    Copy the full SHA
    8abea50 View commit details
    Browse the repository at this point in the history
  2. ggml-gpt: Expose estimate_memory_buffer_size

    This is a stop-gap, since we'll be moving the execution buffer
    elsewhere.
    smspillaz committed Aug 4, 2023
    Configuration menu
    Copy the full SHA
    175617c View commit details
    Browse the repository at this point in the history
  3. ggml-language-model: Add missing ggml-cached-model include

    It got pulled in indirectly, but it should be used directly.
    smspillaz committed Aug 4, 2023
    Configuration menu
    Copy the full SHA
    41bc805 View commit details
    Browse the repository at this point in the history
  4. ggml-model: Add ggml_new_weight_set_for_flattened_desc

    This is if you want to create a set of allocated weights
    without creating the whole model
    smspillaz committed Aug 4, 2023
    Configuration menu
    Copy the full SHA
    11aae67 View commit details
    Browse the repository at this point in the history
  5. ggml-gobject: API break - we take a GGMLExecutionMemory now

    Basically instead of taking GBytes and having memory entangled with
    the model weights, these are now separate concepts. So
    create_model_desc also returns a GGMLLanguageModelDesc with
    separate weight tree descriptions for the memory weights and
    the model weights, where there will be one set of memory weights
    per inference instance as opposed to per-model.
    smspillaz committed Aug 4, 2023
    Configuration menu
    Copy the full SHA
    e24bcdb View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    6bcc664 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2023

  1. Configuration menu
    Copy the full SHA
    19bafaa View commit details
    Browse the repository at this point in the history
  2. ggml-compute-plan: Allocate work tensor inside plan's own context

    Lets not re-use the main computation context, as that needs to be
    preserved in a special way.
    smspillaz committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    bcf5146 View commit details
    Browse the repository at this point in the history
  3. ggml-model: Split out ggml_model_build_compute_graph function

    This can be used to build the compute graph without actually
    building the executing the graph, which can be useful for memory
    allocation.
    smspillaz committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    27317b8 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    974d9d3 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    fddf778 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    0693edc View commit details
    Browse the repository at this point in the history
  7. ggml-execution-memory: Add "alloc" and "recorder" modes

    These are based on the newly added ggml_alloc context mode in
    ggml, which use an allocator to specify graph memory layout as opposed
    to the naive linear allocator. We can also use the allocator to compute
    how much memory is actually required.
    
    The general flow would be that you first run the forward pass with
    worst-case inputs in "recorder" mode to compute a maximal memory usage
    profile. The recorder mode sets tensor data addresses to a region
    that doesn't exist in memory and also takes care to ensure that writes
    to the tensor using the ggml-gobject API don't actually write anything
    to memory.
    
    Then afterwards, you can allocate a buffer of the required size, then
     use the allocator in "alloc" mode to create the same layout, this time
    backed by a real buffer.
    
    Using the alloc mode is fairly cheap, since it doesn't require
    any system calls (all the memory is allocated upfront).
    smspillaz committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    c273790 View commit details
    Browse the repository at this point in the history
  8. ggml-language-model: Use the recorder execution memory to estimate me…

    …mory size
    
    With this we can finally remove the semi-hardcoded memory estimator
    for GPT2 models and instead use a real estimate based on the model's
    actual memory usage.
    smspillaz committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    59df66c View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    0f30426 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    c756e04 View commit details
    Browse the repository at this point in the history