Tensor memory management and memory hierarchy redesign #136

axsaucedo · 2021-02-07T18:04:58Z

From various discussions initial research was carried out to explore better and more optimized ways to handle memory management both on the CPU (heap) and GPU (host/device memory), mainly from points summarised in the discussion in #130 and #36.

This issue will mean we will close #113 and #36

The proposed design encompasses the following changes (which would be carried out one after the other):

Removing the concept of staging type tensors in favour of hostVisible type tensors and using two buffers per tensor
Removing the concept of persistent anonymous sequences (in conjunction with optensor not owning tensor memory)
Amending the memory hierarchy such that Manager owns tensors directly instead of OpCreateTensor
Remove std::vector mData; completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC)

Below each of the points are covered in more details:

1) Removing the concept of staging type tensors in favour of hostVisible type tensors and using two buffers per tensor

This includes removing the concept of a "staging tensor". Instead we will now just have hostVisible, deviceOnlyVisible and storage (which is deviceVisible without src / dst copy). This also encompasses amending tensors to always have two vkBuffers and two vkMemories. This will optimise memory requirements and will reduce complexity from OpTensorCreate to need to hold a separate set of staging vectors only to perform the transfer.

2) Removing the concept of persistent anonymous sequences (in conjunction with optensor not owning tensor memory)

This basically encompasses simply removing the concept of anoymous sequences to avoid implicit accumulation of memory from every time a new manager operation is run. This will re-use the same DEFAULT operation for anonymous functions. The main challenge will be that the OpCreateTensor will always need to be created in a named sequence (although this will be addressed in the next phase)

3) Amending the memory hierarchy such that Manager owns tensors directly instead of OpCreateTensor

This step removes the memory ownership from OpBase towards Tensor. This will require a potentially large breaking change, as the most intuitive way in which this can be added is by ensuring that always Tensors have to be created via the manager, which would need to be via one of the two options:

Running the mgr.buildManagedTensor(....) function, or;
Creating a new t = std::shared_ptr<Tensor(...)> and then "initialise/register it" with the manager via mgr.buildManagedTensor(t)

This will ensure that the Manager keeps track of all the relevant Tensors created, and ensures they are destroyed when the manager is destroyed. Of course, this wouldn't be required if you initialise the tensor with your own vulkan components, such as:

Create a new t = Tensor(...)
Initialise it manually with t.init(vkdevice, vkinstance)

4) Remove `std::vector mData;` completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC)

After doing some resarch I was able to confirm:

All GPU devices will always have both HostVisible and DeviceVisible memory available for use
Performance of GPU hostVisible memory access from CPU is relatively efficeint (even if not as efficient as RAM), this AMD document mentions "up to 64 GB/s peak theoretical transport data bandwidth from CPU to GPU per card." - still needs to be further confirmed
It is possible to leverage std-compatible containers such as std:span-like containers (or msft's gsl/span) to have a container with non-owning reference to the host-visible memory

Due to these points, it may be possible to completely remove the std::vector mData attribut completely whilst still keeping the best of both worlds - namely 1) access to the memory that was added to the GPU, and 2) optimum memory usage (mainly by not storing an extra copy of the data).

This point would still need to be explored further as tehre are some nuances in regards to how this would be implemented in a memory-safe way, as well as ensuring that the python interface can be exposed efficiently, as ideally it can be built in a way in which it can be passed to and from python with minimal amount of copies (as it's not clear what the support for non-owning containers like span is)

The text was updated successfully, but these errors were encountered:

alexander-g · 2021-02-09T06:28:24Z

This also encompasses amending tensors to always have two vkBuffers and two vkMemories.

If I'm not mistaken one buffer will be allocated on the CPU and one on the GPU, is that correct? Does this also apply to storage tensors? For them the CPU buffer seems redundant and with something like DNNs one would have a lot of storage tensors.

axsaucedo · 2021-02-09T06:39:59Z

@alexander-g no, vkBuffer and vkMemory is always vRAM (GPU memory), the only difference is whether the memory is CPU / Host visible or GPU-Device-visible-only. You can have a look at the PR #137 which implements it

axsaucedo · 2021-02-09T06:44:18Z

The main thing here is that instead of having the concept of Staging tensors, now there would be the concept of host-visible-gpu-memory tensors, and device-only-memory-tensors. For host visible it would only use a single buffer for host memory. For device-only in the case of TensorType::eDevice it would create an extra buffer for staging the data into the device, and for the case of TensorType::eStorage, it would only use one buffer. It could also be explored for functionality that only creates the staging the buffer for data transfer into the device - this would add an overhead when reading / writing into the device, but could be a handy option actually (which we can explore once the PRs are merged)

axsaucedo · 2021-02-10T07:09:02Z

Both of the main PRs including phase 1, 2 and 3 have been merged. Closing this issue and adding an extra issue to explore #4 separately

axsaucedo changed the title ~~Tensor memory management and hierarchy redesign~~ Tensor memory management and memory hierarchy redesign Feb 7, 2021

axsaucedo closed this as completed Feb 10, 2021

This was referenced Feb 10, 2021

Explore removing std::vector mData; completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC) #144

Closed

Added destroy functions for tensors and sequences (named and object) #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor memory management and memory hierarchy redesign #136

Tensor memory management and memory hierarchy redesign #136

axsaucedo commented Feb 7, 2021 •

edited

Loading

alexander-g commented Feb 9, 2021

axsaucedo commented Feb 9, 2021

axsaucedo commented Feb 9, 2021

axsaucedo commented Feb 10, 2021

Tensor memory management and memory hierarchy redesign #136

Tensor memory management and memory hierarchy redesign #136

Comments

axsaucedo commented Feb 7, 2021 • edited Loading

1) Removing the concept of staging type tensors in favour of hostVisible type tensors and using two buffers per tensor

2) Removing the concept of persistent anonymous sequences (in conjunction with optensor not owning tensor memory)

3) Amending the memory hierarchy such that Manager owns tensors directly instead of OpCreateTensor

4) Remove std::vector mData; completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC)

alexander-g commented Feb 9, 2021

axsaucedo commented Feb 9, 2021

axsaucedo commented Feb 9, 2021

axsaucedo commented Feb 10, 2021

axsaucedo commented Feb 7, 2021 •

edited

Loading

4) Remove `std::vector mData;` completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC)