-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
service: Move service code into separate library #12
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We'll use this type to encapsulate the notion of a per-instance memory
This is a stop-gap, since we'll be moving the execution buffer elsewhere.
It got pulled in indirectly, but it should be used directly.
This is if you want to create a set of allocated weights without creating the whole model
Basically instead of taking GBytes and having memory entangled with the model weights, these are now separate concepts. So create_model_desc also returns a GGMLLanguageModelDesc with separate weight tree descriptions for the memory weights and the model weights, where there will be one set of memory weights per inference instance as opposed to per-model.
Lets not re-use the main computation context, as that needs to be preserved in a special way.
This can be used to build the compute graph without actually building the executing the graph, which can be useful for memory allocation.
These are based on the newly added ggml_alloc context mode in ggml, which use an allocator to specify graph memory layout as opposed to the naive linear allocator. We can also use the allocator to compute how much memory is actually required. The general flow would be that you first run the forward pass with worst-case inputs in "recorder" mode to compute a maximal memory usage profile. The recorder mode sets tensor data addresses to a region that doesn't exist in memory and also takes care to ensure that writes to the tensor using the ggml-gobject API don't actually write anything to memory. Then afterwards, you can allocate a buffer of the required size, then use the allocator in "alloc" mode to create the same layout, this time backed by a real buffer. Using the alloc mode is fairly cheap, since it doesn't require any system calls (all the memory is allocated upfront).
…mory size With this we can finally remove the semi-hardcoded memory estimator for GPT2 models and instead use a real estimate based on the model's actual memory usage.
…ate library This vastly simplifies the reference implementation (the reference implementation had to handle all sorts of details like creating unix pipe connnections for a private dbus session, etc).
It also manages its own hash table. Once the ref count drops to zero, then we also remove from the hash table.
We'll use this later to handle the terminate() edge case
This should terminate any completions and cause the object to drop off the bus (plus release any resources that are held on the side of the service).
We don't get any signal when the user types into the buffer normally (from insert-at-cursor).
We'll lazy-load on the next inference anyway.
The user can now select between local/dbus usage
The thread func expects to consume the entire state, so we have to steal the pointer from the autoptr.
This is only because GCClosure requires marshalling in/out the function arguments and return value. In our case, type erasure will do - we only really have this so that we can pass closures as object properties.
We'll use this to override the behaviour of sampling from the logits of an unrolled language model.
This can be used to inject a little more randomness into the sampling procedure, in line with what's done in GGML proper.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We now have a
libggml-client
which an application can link to in order to facilitate the process of opening dbus private connections etc. It doesn't have any dependencies on ggml proper, in case there's some reason why the client can't link tolibggml
.