Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor memory management and memory hierarchy redesign #136

Closed
axsaucedo opened this issue Feb 7, 2021 · 4 comments
Closed

Tensor memory management and memory hierarchy redesign #136

axsaucedo opened this issue Feb 7, 2021 · 4 comments

Comments

@axsaucedo
Copy link
Member

axsaucedo commented Feb 7, 2021

From various discussions initial research was carried out to explore better and more optimized ways to handle memory management both on the CPU (heap) and GPU (host/device memory), mainly from points summarised in the discussion in #130 and #36.

This issue will mean we will close #113 and #36

The proposed design encompasses the following changes (which would be carried out one after the other):

  1. Removing the concept of staging type tensors in favour of hostVisible type tensors and using two buffers per tensor
  2. Removing the concept of persistent anonymous sequences (in conjunction with optensor not owning tensor memory)
  3. Amending the memory hierarchy such that Manager owns tensors directly instead of OpCreateTensor
  4. Remove std::vector mData; completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC)

Below each of the points are covered in more details:

1) Removing the concept of staging type tensors in favour of hostVisible type tensors and using two buffers per tensor

This includes removing the concept of a "staging tensor". Instead we will now just have hostVisible, deviceOnlyVisible and storage (which is deviceVisible without src / dst copy). This also encompasses amending tensors to always have two vkBuffers and two vkMemories. This will optimise memory requirements and will reduce complexity from OpTensorCreate to need to hold a separate set of staging vectors only to perform the transfer.

2) Removing the concept of persistent anonymous sequences (in conjunction with optensor not owning tensor memory)

This basically encompasses simply removing the concept of anoymous sequences to avoid implicit accumulation of memory from every time a new manager operation is run. This will re-use the same DEFAULT operation for anonymous functions. The main challenge will be that the OpCreateTensor will always need to be created in a named sequence (although this will be addressed in the next phase)

3) Amending the memory hierarchy such that Manager owns tensors directly instead of OpCreateTensor

This step removes the memory ownership from OpBase towards Tensor. This will require a potentially large breaking change, as the most intuitive way in which this can be added is by ensuring that always Tensors have to be created via the manager, which would need to be via one of the two options:

  • Running the mgr.buildManagedTensor(....) function, or;
  • Creating a new t = std::shared_ptr<Tensor(...)> and then "initialise/register it" with the manager via mgr.buildManagedTensor(t)

This will ensure that the Manager keeps track of all the relevant Tensors created, and ensures they are destroyed when the manager is destroyed. Of course, this wouldn't be required if you initialise the tensor with your own vulkan components, such as:

  • Create a new t = Tensor(...)
  • Initialise it manually with t.init(vkdevice, vkinstance)

4) Remove std::vector mData; completely from Tensor in favour of always storing data in hostVisible buffer memory (TBC)

After doing some resarch I was able to confirm:

  • All GPU devices will always have both HostVisible and DeviceVisible memory available for use
  • Performance of GPU hostVisible memory access from CPU is relatively efficeint (even if not as efficient as RAM), this AMD document mentions "up to 64 GB/s peak theoretical transport data bandwidth from CPU to GPU per card." - still needs to be further confirmed
  • It is possible to leverage std-compatible containers such as std:span-like containers (or msft's gsl/span) to have a container with non-owning reference to the host-visible memory

Due to these points, it may be possible to completely remove the std::vector mData attribut completely whilst still keeping the best of both worlds - namely 1) access to the memory that was added to the GPU, and 2) optimum memory usage (mainly by not storing an extra copy of the data).

This point would still need to be explored further as tehre are some nuances in regards to how this would be implemented in a memory-safe way, as well as ensuring that the python interface can be exposed efficiently, as ideally it can be built in a way in which it can be passed to and from python with minimal amount of copies (as it's not clear what the support for non-owning containers like span is)

@alexander-g
Copy link
Contributor

This also encompasses amending tensors to always have two vkBuffers and two vkMemories.

If I'm not mistaken one buffer will be allocated on the CPU and one on the GPU, is that correct? Does this also apply to storage tensors? For them the CPU buffer seems redundant and with something like DNNs one would have a lot of storage tensors.

@axsaucedo
Copy link
Member Author

@alexander-g no, vkBuffer and vkMemory is always vRAM (GPU memory), the only difference is whether the memory is CPU / Host visible or GPU-Device-visible-only. You can have a look at the PR #137 which implements it

@axsaucedo
Copy link
Member Author

The main thing here is that instead of having the concept of Staging tensors, now there would be the concept of host-visible-gpu-memory tensors, and device-only-memory-tensors. For host visible it would only use a single buffer for host memory. For device-only in the case of TensorType::eDevice it would create an extra buffer for staging the data into the device, and for the case of TensorType::eStorage, it would only use one buffer. It could also be explored for functionality that only creates the staging the buffer for data transfer into the device - this would add an overhead when reading / writing into the device, but could be a handy option actually (which we can explore once the PRs are merged)

@axsaucedo
Copy link
Member Author

Both of the main PRs including phase 1, 2 and 3 have been merged. Closing this issue and adding an extra issue to explore #4 separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants