Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable re-using pool across graph executions #122

Open
robertknight opened this issue Apr 26, 2024 · 2 comments
Open

Enable re-using pool across graph executions #122

robertknight opened this issue Apr 26, 2024 · 2 comments
Labels
performance Issues that affect model inference or loading performance

Comments

@robertknight
Copy link
Owner

robertknight commented Apr 26, 2024

#108 added a tensor pool that enables re-use of output buffers for different steps of graph execution. The entire pool is currently freed at the end of the run. For recurrent / autoregressive models where the caller invokes Model::run in a loop, buffer reuse could be further improved by persisting the pool across runs.

Possible APIs:

  1. Add an optional pool parameter to Model::run which allows the user to specify a pool.
  2. Make the pool a field of the Model or Graph. This would require some changes to the pool to enable it to be used from multiple threads concurrently.
@robertknight robertknight added the performance Issues that affect model inference or loading performance label Apr 26, 2024
@robertknight
Copy link
Owner Author

As an extension of this, it would be useful to be able to pass owned tensors as inputs to graph execution, rather than views, so that their buffers can be added to the pool and used to fulfill allocation requests. An example of when this matters are KV-cache outputs that are returned from transformer decoder models. These caches are then fed as inputs into the next graph execution. Currently new KV-cache buffers will get allocated on each run, but it would be more efficient if they could just be recycled.

@robertknight
Copy link
Owner Author

This was done for sharing between the main graph and subgraphs in #312. This is simpler because the interpreter loop for a subgraph runs on the same thread as the loop for the parent graphs, so doesn't require making TensorPool usable across threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues that affect model inference or loading performance
Projects
None yet
Development

No branches or pull requests

1 participant