Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature proposal] Support RAPIDS Memory Manager (RMM) #5861

Closed
daxiongshu opened this issue Jul 6, 2020 · 12 comments · Fixed by #5873
Closed

[Feature proposal] Support RAPIDS Memory Manager (RMM) #5861

daxiongshu opened this issue Jul 6, 2020 · 12 comments · Fixed by #5873

Comments

@daxiongshu
Copy link

RAPIDS is a GPU based data analytics framework developed by NVIDIA. A typical all-GPU workflow is to use RAPIDS cudf to read data and create features, and then use xgboost GPU to train. The two libraries end up competing for memory on the GPU and could lead to OOM since device memory is often limited on GPU.

To mitigate this issue, I propose that xgboost supports RMM: RAPIDS Memory Manager as an alternative memory allocator and share memory pool with other RAPIDS libraries.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2020

To my knowledge, XGBoost uses thrust::device_malloc_allocator and cub::CachingDeviceAllocator to allocate GPU memory:

// Declare xgboost allocators
// Replacement of allocator with custom backend should occur here
template <typename T>
using XGBDeviceAllocator = detail::XGBDefaultDeviceAllocatorImpl<T>;
/*! Be careful that the initialization constructor is a no-op, which means calling
* `vec.resize(n)` won't initialize the memory region to 0. Instead use
* `vec.resize(n, 0)`*/
template <typename T>
using XGBCachingDeviceAllocator = detail::XGBCachingDeviceAllocatorImpl<T>;

Can RMM work with Thrust and Cub allocators?

@jrhemstad
Copy link

To my knowledge, XGBoost uses thrust::device_malloc_allocator and cub::CachingDeviceAllocator to allocate GPU memory:

What @daxiongshu is suggesting is allowing configuring/customizing XGBoost to use something other than hardcoding it to use the thrust::device_malloc_allocator and cub::CachingDeviceAllocator and instead allow passing in a different allocator, such as one of RMM's sub-allocator implementations.

For more detail on RMM's interface, see https://github.com/rapidsai/rmm#using-rmm-in-c.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2020

@jrhemstad Thanks for clarifying. Does RMM provide a replacement with a semantic similar to the cached allocator cub::CachingDeviceAllocator? This page describes the caching semantics, and I'm not sure about the impact of replacing the caching allocator with RMM's sub-allocator.

EDIT. I just found rapidsai/rmm#290. RMM doesn't yet provide an equivalent of cub::CachingDeviceAllocator.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2020

@trivialfis @RAMitchell Can the use of cub::CachingDeviceAllocator safely replaced with another allocator without caching semantics? Do you know why the caching allocator was chosen in the XGBoost codebase?

@trivialfis
Copy link
Member

I don't think we need to replace caching allocator. thrust allocator on the other hand might be worth thinking about.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2020

@trivialfis

I don't think we need to replace caching allocator.

Please elaborate why. Does the caching allocator use a small amount of GPU memory?

@trivialfis
Copy link
Member

Yup. The allocator used for host device vector is thrust allocator.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2020

How about the DMatrix? Does it also use the Thrust allocator? Then we can let users use RMM instead of the Thrust allocator and potentially achieve a memory saving.

@trivialfis
Copy link
Member

Yup, DMatrix is backed by thrust allocator. Not sure about ellpack, will check tomorrow.

@jrhemstad
Copy link

jrhemstad commented Jul 6, 2020

@jrhemstad Thanks for clarifying. Does RMM provide a replacement with a semantic similar to the cached allocator cub::CachingDeviceAllocator? This page describes the caching semantics, and I'm not sure about the impact of replacing the caching allocator with RMM's sub-allocator.

EDIT. I just found rapidsai/rmm#290. RMM doesn't yet provide an equivalent of cub::CachingDeviceAllocator.

Correct, we don't yet have a wrapper for CUB's caching allocator. As you found, I worked on this in rapidsai/rmm#290, but I quickly saw it was much slower than our pool allocator, so I didn't pursue it further.

For example, here are results from one of our benchmarks of doing 1000 allocations/frees of uniformly random sizes in a range [1MB, Max Size] with a variety of maximum sizes. In this particular benchmark, allocations are ~2x more likely than a deallocation, so the peak memory consumption continues to grow until we've reached 1000 allocations, at which point, we free all remaining allocations. The graph shows the cumulative time of all 1000 allocation/frees.

For the CUB caching allocator, I used the same parameters that are used in XGBoost here:

static cub::CachingDeviceAllocator *allocator = new cub::CachingDeviceAllocator(2, 9, 29);

1000 Uniform Random Allocations

As you can see from this (limited) benchmark, the pool implementation is 100-10000x faster than CUB's caching scheme.

Code I used to run this is here: rapidsai/rmm#430

Benchmark code is here: https://github.com/rapidsai/rmm/blob/branch-0.15/benchmarks/random_allocations/random_allocations.cpp

@hcho3
Copy link
Collaborator

hcho3 commented Jul 7, 2020

I think RMM could also give us more tools to manage and track memory usage.

@jrhemstad
Copy link

I think RMM could also give us more tools to manage and track memory usage.

Yes, there are several other benefits of RMM other than performance. Because it defines a single, common interface for device memory allocation, it allows us to build additional tools around that interface. For example, logging, leak checking, thread safety, benchmarks, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants