Add buffer pool/arena to enable re-use of temporary buffers during graph execution #108

robertknight · 2024-04-22T19:11:10Z

Graph execution often spends a significant amount of time allocating or freeing large buffers using the system allocator. So far this is mitigated for some operators by running them in-place on the first input, however there are many important operations which cannot run in-place, and many cases where operators that can run in place do not because an input is needed by a subsequent operation.

This PR introduces a tensor buffer pool (TensorPool), which is created at the start of the graph run, and used by operators as an allocator for their outputs. Once a value is no longer needed by subsequent steps of graph execution, the buffer is added to the pool and made available for use by subsequent steps. New output has been added to the timing report enabled by RTEN_TIMING, reporting the total number of allocation requests to the pool and the hit rate (how often buffer requests were fulfilled from the pool).

The pool is disabled by default and enabled by setting the RTEN_USE_POOL env var. It will be enabled once the majority of operators are converted to allocate from the pool.

To verify this works, a subset of operators have been converted to allocate from the pool, based on ops used by the YOLOv8 example. In this example, this reduces execution times on my laptop from 210-220ms to 180-190ms, and this may improve further when additional operators are converted to use pool allocation.

Most operators do not yet allocate from the pool, and they will be converted in subsequent commits.

TODO:

Tests for new Tensor* methods
Tests for TensorPool
Add feature flag to make the pool opt-in until more operators are converted to use it

This extracts the data buffer from a tensor without making it contiguous.

This will be useful for a tensor pool/arena in the rten crate.

This initializes a `Tensor<MaybeUninit<T>>` by copying data from an existing view/tensor.

Improve buffer re-use during graph execution by adding a pool from which operators can allocate output buffers, and into which buffers are added when their ref count drops to zero (ie. when they are no longer needed by subsequent graph execution steps). This significantly reduces how often execution needs to allocate "fresh" buffers from the system allocator and free them back. In this initial implementation, a reference to the pool is passed to all operators via `Operator::run`, but only a subset actually use the pool. This subset was chosen to benefit the YOLOv8 example. - Add `pool` argument to `Operator::run`, specifying a pool from which operators should allocate their outputs - Create a pool at the start of graph execution and release it at the end. Intermediate values that are no longer needed are added to the pool after each operator runs. - Report the number of allocations from the pools and the hit rate (how often the pool was able to satisfy allocations) as part of timing info. - Modify an initial subset of allocators to allocate from the pool, based on what helps the YOLOv8 example.

If the `RTEN_USE_POOL` env var is set, the pool will be used. Otherwise the pool is still created, but buffers are never added to it, so all allocations go through the system allocator as before.

robertknight added 3 commits April 22, 2024 19:52

Add Tensor::into_non_contiguous_data

b6a7fa8

This extracts the data buffer from a tensor without making it contiguous.

Export IntoLayout from rten-tensor

4c02605

This will be useful for a tensor pool/arena in the rten crate.

Add Tensor::init_from

a3e2dd9

This initializes a `Tensor<MaybeUninit<T>>` by copying data from an existing view/tensor.

robertknight force-pushed the pool-alloc branch 2 times, most recently from 312c665 to 9fb4793 Compare April 23, 2024 06:41

robertknight force-pushed the pool-alloc branch from 9fb4793 to 11be82a Compare April 23, 2024 07:00

Add a temporary feature flag for the memory pool

e49ab2c

If the `RTEN_USE_POOL` env var is set, the pool will be used. Otherwise the pool is still created, but buffers are never added to it, so all allocations go through the system allocator as before.

robertknight marked this pull request as ready for review April 23, 2024 08:10

robertknight mentioned this pull request Apr 23, 2024

Convert remaining operators to use tensor pool #109

Closed

robertknight merged commit dfa490e into main Apr 23, 2024
2 checks passed

robertknight deleted the pool-alloc branch April 23, 2024 08:14

This was referenced Apr 26, 2024

Enable re-using pool across graph executions #122

Open

Use buffer pool for prepacked matrices #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add buffer pool/arena to enable re-use of temporary buffers during graph execution #108

Add buffer pool/arena to enable re-use of temporary buffers during graph execution #108

robertknight commented Apr 22, 2024 •

edited

Loading

Add buffer pool/arena to enable re-use of temporary buffers during graph execution #108

Add buffer pool/arena to enable re-use of temporary buffers during graph execution #108

Conversation

robertknight commented Apr 22, 2024 • edited Loading

robertknight commented Apr 22, 2024 •

edited

Loading