Add buffer pool/arena to enable re-use of temporary buffers during graph execution #108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Graph execution often spends a significant amount of time allocating or freeing large buffers using the system allocator. So far this is mitigated for some operators by running them in-place on the first input, however there are many important operations which cannot run in-place, and many cases where operators that can run in place do not because an input is needed by a subsequent operation.
This PR introduces a tensor buffer pool (
TensorPool
), which is created at the start of the graph run, and used by operators as an allocator for their outputs. Once a value is no longer needed by subsequent steps of graph execution, the buffer is added to the pool and made available for use by subsequent steps. New output has been added to the timing report enabled byRTEN_TIMING
, reporting the total number of allocation requests to the pool and the hit rate (how often buffer requests were fulfilled from the pool).The pool is disabled by default and enabled by setting the
RTEN_USE_POOL
env var. It will be enabled once the majority of operators are converted to allocate from the pool.To verify this works, a subset of operators have been converted to allocate from the pool, based on ops used by the YOLOv8 example. In this example, this reduces execution times on my laptop from 210-220ms to 180-190ms, and this may improve further when additional operators are converted to use pool allocation.
Most operators do not yet allocate from the pool, and they will be converted in subsequent commits.
TODO:
Tensor*
methodsTensorPool