Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default pinned pool that falls back to new pinned allocations #15665

Merged
merged 44 commits into from
May 20, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented May 6, 2024

Description

Issue #15612

Adds a pooled pinned memory resource that is created on first call to get_host_memory_resource or set_host_memory_resource.
The pool has a fixed size: 0.5% of the device memory capacity, limited to 100MB. At 100MB, the pool takes ~30ms to initialize. Size of the pool can be overridden with environment variable LIBCUDF_PINNED_POOL_SIZE.
If an allocation cannot be done within the pool, a new pinned allocation is performed.
The allocator uses a stream from the global stream pool to initialize and perform synchronous operations (allocate/deallocate). Users of the resource don't need to be aware of this implementation detail as these operations synchronize before they are completed.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added Performance Performance related issue non-breaking Non-breaking change labels May 6, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 6, 2024
@vuule vuule added the feature request New feature or request label May 6, 2024
/**
* @brief Get the global stream pool.
*/
cuda_stream_pool& global_cuda_stream_pool();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to expose the pool to get a stream from it without forking


size_t free{}, total{};
cudaMemGetInfo(&free, &total);
// 0.5% of the total device memory, capped at 100MB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a 100MB pool takes 30ms on my system. This is smaller then CUDA runtime init (~60ms) and cuFile/kvikio init (180ms) so IMO this won't significantly impact user experience.

@vuule vuule marked this pull request as ready for review May 8, 2024 18:34
@vuule vuule requested a review from a team as a code owner May 8, 2024 18:35
@vuule vuule requested a review from bdice May 8, 2024 18:35
/**
* @brief Configure the size of the default host memory resource.
*
* Must be called before any other function in this header.
Copy link
Contributor

@ttnghia ttnghia May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dangerous requirement, and may not be satisfied. How about making the static function re-configuruable?

Copy link
Contributor

@ttnghia ttnghia May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For doing so:

  1. Static variable is declared outside of function scope (but in an anonymous namespace, so it is static inside just this TU). In addition, it can be a smart pointer.
  2. host_mr will initialize it with std::nullopt size if it is nullptr, otherwise just derefs the current pointer and returns.
  3. User can specify a size parameter to recompute and overwrite that static variable with a new mr.
  4. All these ops should be thread safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, the issue with calling config after get/set is that it would have no effect.

allowing this opens another can of worms., e.g. what is the intended effect of calling config after set?

Copy link
Contributor

@ttnghia ttnghia May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't allow this, let's make some validity check to prevent it from being accidentally misused. It sounds unsafe if we just make an assumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abellina what behavior do you suggest when config is called after the first resource use? I'm not sure if we should throw or just warn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw, I agree with @ttnghia that we should do something in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a mechanism to throw if config is called after the default resource has already been created.
@abellina might be good to test your branch with this change.

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuule thank you for adding the configuration API. I have a local branch that I will PR once this goes in that pipes that API through our JNI layer, so we can set it from the plugin.

  1. I tested this as is and saw both pinned allocations (once for the default pinned pool, and once for ours).
  2. With the config set to 0 I only see 1 allocation (ours)
  3. I re-verified that the default resource works with a 0 size allocation: I see cudaHostAlloc.
  4. I also verified that we use our resource if we set it, as it used to work before.

LGTM

cpp/src/io/utilities/config_utils.cpp Outdated Show resolved Hide resolved
{
static rmm::mr::pinned_host_memory_resource default_mr{};
return default_mr;
static rmm::host_async_resource_ref mr_ref = make_default_pinned_mr(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this was asked before but I'm curious when/how this object is destroyed?
Is it destroyed automatically when the process ends i.e. after main() completes?
Are there any CUDA API calls in the destructor(s)?
Maybe this is ok for host memory resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great question. Currently the pool itself is not destroyed as it caused a segfault at the end of some tests; presumably because of the call to cudaFreeHost after main(). But this is something I should revisit and verify what exactly the issue was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can't destroy a static pool resource object. Open to suggestions to avoid the pool leak.

static rmm::mr::pinned_host_memory_resource default_mr{};
return default_mr;
static rmm::host_async_resource_ref* mr_ref = nullptr;
CUDF_EXPECTS(mr_ref == nullptr, "The default host memory resource has already been created");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work as intended.

  1. call config_default_host_memory_resource
  2. call set_host_memory_resource(cudf_jni_resource) or skip this step, it fails either on the set_host_memory_resource call or when the first code tries to call get_host_memory_resource.
  3. We see parquet read blowing up with "The default host memory resource has already been created".

I think I know why:

  1. config_default_host_memory_resource: calls make_host_mr, which sets mr_ref to not nullptr.
  2. call set_host_memory_resource(cudf_jni_resource): calls host_mr whose mr_ref is NOT set yet, so it has to call make_host_mr, this blows up. If you don't set a custom resource, get_host_memory_resource also calls host_mr blowing up.

I think we need to go through host_mr in all code paths. That way mr_ref is set when config_default_host_memory_resource is called as well. I think this implies adding an optional size to host_mr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to use some sort of out param that gets passed to host_mr that says whether it was initialized or not. If we do end up creating a resource we would indicate that by setting the out param to true (e.g. did_initialize). Then in configure_default_host_memory_resource we could detect if !did_initialize and throw there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the logic; seems to work in local tests (tested both cases this time 😅 )
@abellina Please rerun tests and let me know. I'll also try to come up with unit tests before merging.

@abellina abellina self-requested a review May 15, 2024 14:40
Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new throw logic isn't working as expected, please see my comment.

@vuule vuule requested a review from abellina May 16, 2024 16:45
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, thanks for some offline discussions.

@vuule
Copy link
Contributor Author

vuule commented May 20, 2024

/merge

@rapids-bot rapids-bot bot merged commit 58f4526 into rapidsai:branch-24.06 May 20, 2024
71 checks passed
@vuule vuule deleted the perf-defaul-piined-pool branch May 20, 2024 22:28
rapids-bot bot pushed a commit that referenced this pull request May 21, 2024
This PR depends on #15665 and so it won't build until that PR merges.

Adds support for `cudf::io::config_host_memory_resource` which is being worked on in #15665.  In 24.06 we are going to disable the cuDF pinned pool and look into this more in 24.08. 

We currently have a pinned pooled resource that has been setup to share pinned memory with other APIs we use from java, so we wanted to prevent extra pinned memory being created by default, and @vuule has added an API for us to call to accomplish this.

Authors:
  - Alessandro Bellina (https://github.com/abellina)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)

URL: #15745
pool_->deallocate_async(pool_begin_, pool_size_, stream_);
}

void* do_allocate_async(std::size_t bytes, std::size_t alignment, cuda::stream_ref stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late, but these do_ versions should probably be protected/private?

Comment on lines +222 to +225
size_t free{}, total{};
CUDF_CUDA_TRY(cudaMemGetInfo(&free, &total));
// 0.5% of the total device memory, capped at 100MB
return std::min(total / 200, size_t{100} * 1024 * 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late, but this should use rmm::percent_of_free_device_memory. That function only takes an integer percent. If you need a decimal percent, please file an issue. Or you can just use 1% and then divide by 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least use rmm::available_device_memory().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could be using rmm::available_device_memory() to get the memory capacity. I'll address this in 24.08.
If there's a plan to add percent_of_total_device_memory, that would be even better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants