-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default pinned pool that falls back to new pinned allocations #15665
Add default pinned pool that falls back to new pinned allocations #15665
Conversation
…bug-allocator-copy-wrong-stream
…vuule/cudf into bug-allocator-copy-wrong-stream
…perf-defaul-piined-pool
… into perf-defaul-piined-pool
/** | ||
* @brief Get the global stream pool. | ||
*/ | ||
cuda_stream_pool& global_cuda_stream_pool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to expose the pool to get a stream from it without forking
|
||
size_t free{}, total{}; | ||
cudaMemGetInfo(&free, &total); | ||
// 0.5% of the total device memory, capped at 100MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating a 100MB pool takes 30ms on my system. This is smaller then CUDA runtime init (~60ms) and cuFile/kvikio init (180ms) so IMO this won't significantly impact user experience.
…perf-defaul-piined-pool
/** | ||
* @brief Configure the size of the default host memory resource. | ||
* | ||
* Must be called before any other function in this header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dangerous requirement, and may not be satisfied. How about making the static function re-configuruable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For doing so:
- Static variable is declared outside of function scope (but in an anonymous namespace, so it is static inside just this TU). In addition, it can be a smart pointer.
host_mr
will initialize it withstd::nullopt
size if it isnullptr
, otherwise just derefs the current pointer and returns.- User can specify a size parameter to recompute and overwrite that static variable with a new mr.
- All these ops should be thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, the issue with calling config after get/set is that it would have no effect.
allowing this opens another can of worms., e.g. what is the intended effect of calling config after set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't allow this, let's make some validity check to prevent it from being accidentally misused. It sounds unsafe if we just make an assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abellina what behavior do you suggest when config is called after the first resource use? I'm not sure if we should throw or just warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should throw, I agree with @ttnghia that we should do something in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a mechanism to throw if config is called after the default resource has already been created.
@abellina might be good to test your branch with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vuule thank you for adding the configuration API. I have a local branch that I will PR once this goes in that pipes that API through our JNI layer, so we can set it from the plugin.
- I tested this as is and saw both pinned allocations (once for the default pinned pool, and once for ours).
- With the config set to 0 I only see 1 allocation (ours)
- I re-verified that the default resource works with a 0 size allocation: I see cudaHostAlloc.
- I also verified that we use our resource if we set it, as it used to work before.
LGTM
…perf-defaul-piined-pool
{ | ||
static rmm::mr::pinned_host_memory_resource default_mr{}; | ||
return default_mr; | ||
static rmm::host_async_resource_ref mr_ref = make_default_pinned_mr(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this was asked before but I'm curious when/how this object is destroyed?
Is it destroyed automatically when the process ends i.e. after main()
completes?
Are there any CUDA API calls in the destructor(s)?
Maybe this is ok for host memory resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great question. Currently the pool itself is not destroyed as it caused a segfault at the end of some tests; presumably because of the call to cudaFreeHost after main(). But this is something I should revisit and verify what exactly the issue was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, can't destroy a static pool resource object. Open to suggestions to avoid the pool leak.
static rmm::mr::pinned_host_memory_resource default_mr{}; | ||
return default_mr; | ||
static rmm::host_async_resource_ref* mr_ref = nullptr; | ||
CUDF_EXPECTS(mr_ref == nullptr, "The default host memory resource has already been created"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work as intended.
- call
config_default_host_memory_resource
- call
set_host_memory_resource(cudf_jni_resource)
or skip this step, it fails either on theset_host_memory_resource
call or when the first code tries to callget_host_memory_resource
. - We see parquet read blowing up with "The default host memory resource has already been created".
I think I know why:
config_default_host_memory_resource
: callsmake_host_mr
, which setsmr_ref
to notnullptr
.- call
set_host_memory_resource(cudf_jni_resource)
: callshost_mr
whosemr_ref
is NOT set yet, so it has to callmake_host_mr
, this blows up. If you don't set a custom resource,get_host_memory_resource
also callshost_mr
blowing up.
I think we need to go through host_mr
in all code paths. That way mr_ref
is set when config_default_host_memory_resource
is called as well. I think this implies adding an optional size to host_mr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to use some sort of out param that gets passed to host_mr
that says whether it was initialized or not. If we do end up creating a resource we would indicate that by setting the out param to true (e.g. did_initialize). Then in configure_default_host_memory_resource
we could detect if !did_initialize
and throw there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the logic; seems to work in local tests (tested both cases this time 😅 )
@abellina Please rerun tests and let me know. I'll also try to come up with unit tests before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new throw logic isn't working as expected, please see my comment.
…perf-defaul-piined-pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, thanks for some offline discussions.
/merge |
This PR depends on #15665 and so it won't build until that PR merges. Adds support for `cudf::io::config_host_memory_resource` which is being worked on in #15665. In 24.06 we are going to disable the cuDF pinned pool and look into this more in 24.08. We currently have a pinned pooled resource that has been setup to share pinned memory with other APIs we use from java, so we wanted to prevent extra pinned memory being created by default, and @vuule has added an API for us to call to accomplish this. Authors: - Alessandro Bellina (https://github.com/abellina) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: #15745
pool_->deallocate_async(pool_begin_, pool_size_, stream_); | ||
} | ||
|
||
void* do_allocate_async(std::size_t bytes, std::size_t alignment, cuda::stream_ref stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late, but these do_
versions should probably be protected/private?
size_t free{}, total{}; | ||
CUDF_CUDA_TRY(cudaMemGetInfo(&free, &total)); | ||
// 0.5% of the total device memory, capped at 100MB | ||
return std::min(total / 200, size_t{100} * 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late, but this should use rmm::percent_of_free_device_memory
. That function only takes an integer percent. If you need a decimal percent, please file an issue. Or you can just use 1% and then divide by 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least use rmm::available_device_memory()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could be using rmm::available_device_memory()
to get the memory capacity. I'll address this in 24.08.
If there's a plan to add percent_of_total_device_memory
, that would be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already exists.
Description
Issue #15612
Adds a pooled pinned memory resource that is created on first call to
get_host_memory_resource
orset_host_memory_resource
.The pool has a fixed size: 0.5% of the device memory capacity, limited to 100MB. At 100MB, the pool takes ~30ms to initialize. Size of the pool can be overridden with environment variable
LIBCUDF_PINNED_POOL_SIZE
.If an allocation cannot be done within the pool, a new pinned allocation is performed.
The allocator uses a stream from the global stream pool to initialize and perform synchronous operations (
allocate
/deallocate
). Users of the resource don't need to be aware of this implementation detail as these operations synchronize before they are completed.Checklist