Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback resource adaptor #1665

Draft
wants to merge 18 commits into
base: branch-24.10
Choose a base branch
from

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Sep 2, 2024

New resource adaptor that uses an alternate upstream resource when the primary throws a specified exception type.

The motivation here is to provide NO-OOM by using managed memory when the primary device resource runs out of memory.

Checklist

  • I am familiar with the Contributing Guidelines.
  • Python tests
  • C++ tests
  • The documentation is up to date with these changes.

@madsbk madsbk added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 2, 2024
@github-actions github-actions bot added Python Related to RMM Python API cpp Pertains to C++ code labels Sep 2, 2024
@madsbk madsbk marked this pull request as ready for review September 2, 2024 15:18
@madsbk madsbk requested review from a team as code owners September 2, 2024 15:18
@madsbk madsbk requested review from harrism and bdice September 2, 2024 15:18
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pre-empt #1661

Comment on lines 48 to 50
template <typename PrimaryUpstream,
typename AlternateUpstream,
typename ExceptionType = rmm::out_of_memory>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: We are moving away from having templated adaptors and instead just using type-erased resource_ref objects (see #1661). To avoid introducing a new such class here, can we immediately move to the new model:

template<typename ExceptionType = rmm::out_of_memory>
class failure_alternate_resource_adaptor final : public device_memory_resource {
   ...
   failure_alternate_resource_adaptor(device_async_resource_ref primary, device_async_resource_ref alternate) : ... {}
   

Or, if we can't type-erase yet, we should at least accept resource refs as an alternate constructor, and store resource_refs rather than templated type-specific usptreams.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cbd7f43, which rely on implicit type conversion from device_memory_resource* to device_async_resource_ref

@harrism
Copy link
Member

harrism commented Sep 3, 2024

Devil's advocating: couldn't we implement this using failure_callback_resource_adaptor? The callback just needs to know about the alternate MR.

@madsbk
Copy link
Member Author

madsbk commented Sep 3, 2024

Devil's advocating: couldn't we implement this using failure_callback_resource_adaptor? The callback just needs to know about the alternate MR.

Not as-is. The callback function in failure_callback_resource_adaptor returns a boolean thus there is no direct way for the callback to return the alternate allocation back to failure_callback_resource_adaptor . Another issue is how to handle the deallocation, an allocation allocated by the alternate resource cannot be deallocated by the primary resource.

It is possible to write a new resource that can handle both failure_callback_resource_adaptor and failure_alternate_resource_adaptor but I think this is a case were two simple resources are better than a complex one.

@madsbk madsbk requested a review from wence- September 3, 2024 06:31
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly doc comments. However also needs C++ tests.

Comment on lines 94 to 95
* @throws `exception_type` if the requested allocation could not be fulfilled
* by the primary or the alternate upstream resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ExceptionType. But actually I don't think you can say what type of exception will be thrown by alternate_upstream if it fails to allocate. It could be a CUDA error, or it could be rmm::out_of_memory, or some other exception.

The alternate upstream should document what exceptions it may throw.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

python/rmm/rmm/_lib/memory_resource.pyx Outdated Show resolved Hide resolved
python/rmm/rmm/_lib/memory_resource.pyx Outdated Show resolved Hide resolved
python/rmm/rmm/_lib/memory_resource.pyx Outdated Show resolved Hide resolved
@madsbk madsbk requested a review from a team as a code owner September 3, 2024 11:08
@github-actions github-actions bot added the CMake label Sep 3, 2024
@madsbk madsbk requested a review from harrism September 3, 2024 11:08
@madsbk
Copy link
Member Author

madsbk commented Sep 3, 2024

Mostly doc comments. However also needs C++ tests.

Added C++ tests

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, good C++ tests.

tests/mr/device/failure_alternate_mr_tests.cpp Outdated Show resolved Hide resolved
Co-authored-by: Mark Harris <[email protected]>
@madsbk madsbk mentioned this pull request Sep 4, 2024
4 tasks
* @tparam ExceptionType The type of exception that this adaptor should respond to.
*/
template <typename ExceptionType = rmm::out_of_memory>
class failure_alternate_resource_adaptor final : public device_memory_resource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fallback_resource_adapater feels like a more concise name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will rename.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wence- wence- changed the title Failure alternate resource adaptor Fallback resource adaptor Sep 6, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with request to rename test files appropriately.

tests/mr/device/failure_alternate_mr_tests.cpp Outdated Show resolved Hide resolved
tests/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still discussing the use cases of this feature offline -- for now, I'm requesting changes so that we don't merge this with typos in the API and filename name.

include/rmm/mr/device/fallback_resource_adapater.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/fallback_resource_adapater.hpp Outdated Show resolved Hide resolved
@madsbk madsbk marked this pull request as draft September 11, 2024 13:30
@madsbk
Copy link
Member Author

madsbk commented Sep 11, 2024

Let's put this on hold until we get some more use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp Pertains to C++ code improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to RMM Python API
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

5 participants