-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Statistics Resource Adaptor and cython bindings to tracking_resource_adaptor
and statistics_resource_adaptor
#626
Add Statistics Resource Adaptor and cython bindings to tracking_resource_adaptor
and statistics_resource_adaptor
#626
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
If you must have a reset method, why not just reset the peak and total, but not the current allocations? This way the current values can never disagree with the map. The ideal usage should be to reset after freeing all outstanding allocations, in which case resetting the peak and total would make all things zero. |
I had considered this but I don't think it would work. Since the peak value is just
I see what you are saying, but in our test suite it's difficult to free all GPU memory between tests due to dataset fixtures and shared data between tests. When using this memory resource we created a pytest plugin that will:
This allows us to determine the memory usage per test (important for running tests in parallel), and find memory leaks on a per test basis if the current allocation count doesn't return to 0. With all that in mind, I would prefer to keep the reset functionality in there somewhere to allow our pytest plugin to work correctly. But I'm open to other ideas. There are more complex designs that could certainly work, I have just avoided them to keep the design simple and not impact performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python / Cython lgtm, thanks!
FWIW, I think tests should start with a new memory resource each time the fixture is created. gtests do this. Therefore the memory is completely freed up between tests. This enables tests to be completely independent and not interfere with each other. I don't feel OK with a current usage that can be negative. |
Co-authored-by: Rong Ou <[email protected]>
Got it. I'll see if I can work on another design which works for the cuML team that doesn't allow negative usage numbers.
I was hesitant to create a new MR for each test since some of our tests configure the MR directly and I wasn't sure the impact this would have. Additionally, some of the resources held between tests may be resized within a test which would not be tracked by a new MR. I'll do some experimentation and see what impact creating a new MR for each test has on the results we are looking for. Thanks for the feedback. |
@mdemoret-nv you may want to look at what we do in libcudf gtests as well. |
…racking-resource-adaptor
…racking-resource-adaptor
@harrism I incorporated your feedback and have updated the design to remove the Instead, I changed the design to store an internal "stack" of tracked values that you can then push/pop as needed. This doesn't eliminate the possibility of getting negative current allocations (which is technically possible with any I've tested this in the cuML pytest suite and it works well and does what we need in order to find memory leaks and track memory usage per test. It would be great to get this re-reviewed. In addition, I have a few outstanding questions I could use your input on:
Let me know if you have any questions. Thanks for your feedback so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the whole stack thing, it's overcomplicated. If you want to reset the tracking without resetting the upstream memory resource, then delete the tracking_resource_adaptor, and create a new one with the same upstream as the old one.
Keep the tracking simple. No need for a struct of counts, a stack, or any of that.
* @brief Returns an allocation_counts struct for this adaptor containing the | ||
* total current, peak, and total number of bytes and allocation count for | ||
* this adaptor regardless of any push/popped allocation_counts. Note: Because | ||
* its possible to change memory resources at any time while maintaining the | ||
* same upstream memory resource, its possible to have a negative allocation | ||
* bytes or count if the number of deallocate() calls is greater than the | ||
* number of allocate(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* this adaptor regardless of any push/popped allocation_counts. Note: Because | ||
* its possible to change memory resources at any time while maintaining the | ||
* same upstream memory resource, its possible to have a negative allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the note, or the reason for a stack. You can't change the upstream for a resource_adaptor
, you would have to create a new resource adaptor. You can't have more deallocations than allocations without an exception. This seems way overcomplicated.
I'm very sorry for the delay in reviewing @mdemoret-nv. Holidays and then life (outside work) have gotten in the way for the past week+. |
Should we push this to 0.18 to avoid changing |
auto allocated_bytes = found->second.allocation_size; | ||
|
||
if (allocated_bytes != bytes) { | ||
// Don't throw but log an error. Throwing in a descructor (or any noexcept) will call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Don't throw but log an error. Throwing in a descructor (or any noexcept) will call | |
// Don't throw but log an error. Throwing in a destructor (or any noexcept) will call |
I still think statistics and leak detection should be in two separate memory resource adaptors. Moving to 21.08 |
…nto a new adaptor statistics_resource_adapter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest changes make me happy! Nice separation of concerns. Just a couple of comments.
@@ -81,6 +81,7 @@ endif(CUDA_STATIC_RUNTIME) | |||
|
|||
target_link_libraries(rmm INTERFACE rmm::Thrust) | |||
target_link_libraries(rmm INTERFACE spdlog::spdlog_header_only) | |||
target_link_libraries(rmm INTERFACE dl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Quick google shows that dladdr now lives in libc rather than libdl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With stack traces enabled, this was needed to compile the tests (original comment). Keith and I briefly discussed this here: #626 (comment).
Can you send me the link where you saw that dladdr
has moved? All I am seeing from this link is:
Link with -ldl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, the docs I found were not for linux -- Solaris and something called illumos. As I said, it was a quick google.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested that all is well in libcudf, cuML, etc. when this library is linked here? Note that the other target_link_libraries
for RMM are all header-only, which is why this one has me worried (RMM is a header-only library).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdemoret reports cuML builds and tests fine against this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Need to update the title and description before merging. |
tracking_resource_adaptor
and statistics_resource_adaptor
The following is the new description for this PR since it has changed so much during development. Leaving the PR description (aka the first comment) in place to preserve history, for now. New description: This PR updates the C++
These two MR can be used separately and together to track memory allocations, check for memory leaks, and identify incorrect deallocations. While both MRs can track the current number of allocations, they have different areas of focus. The The |
The updated description needs to be in the first (original) comment or our merge scripts won't copy the right description into the changelog. |
@gpucibot merge |
Closes #622 and Closes #623
This PR updates the C++
tracking_resource_adaptor
with stack trace information and also adds a new MR,statistics_resource_adaptor
. Summary of all changes:tracking_resource_adaptor
changes:rmm.mr.TrackingResourceAdaptor
which wraps all available methodstracking_resource_adaptor
to correctly log stack trace information withcapture_stacks=True
statistics_resource_adaptor
memory resource:rmm.mr.StatisticsResourceAdaptor
which wraps all available methodsThese two MR can be used separately and together to track memory allocations, check for memory leaks, and identify incorrect deallocations. While both MRs can track the current number of allocations, they have different areas of focus.
The
tracking_resource_adaptor
is designed more towards identifying and fixing memory leaks, and will log stack trace information for every memory allocation. This MR will have significant performance impacts since it logs a large amount of information for every allocation.The
statistics_resource_adaptor
is a lightweight MR that adds simple counters to track the allocated bytes and allocation count. This MR will have significantly less of a performance impact but cannot identify the cause of memory leaks, only that they exist. This MR is also great at tracking peak memory usage and can be helpful in identifying areas that require large amounts of memory or helping developers measure memory usage reductions during optimization.