Add Statistics Resource Adaptor and cython bindings to `tracking_resource_adaptor` and `statistics_resource_adaptor` #626

mdemoret-nv · 2020-11-10T00:31:31Z

Closes #622 and Closes #623

This PR updates the C++ tracking_resource_adaptor with stack trace information and also adds a new MR, statistics_resource_adaptor. Summary of all changes:

tracking_resource_adaptor changes:
- Added Cython wrapper rmm.mr.TrackingResourceAdaptor which wraps all available methods
- Updated tracking_resource_adaptor to correctly log stack trace information with capture_stacks=True
Added statistics_resource_adaptor memory resource:
- This MR will keep track of the current, peak and total allocated bytes and number of allocations
- Added Cython wrapper rmm.mr.StatisticsResourceAdaptor which wraps all available methods

These two MR can be used separately and together to track memory allocations, check for memory leaks, and identify incorrect deallocations. While both MRs can track the current number of allocations, they have different areas of focus.

The tracking_resource_adaptor is designed more towards identifying and fixing memory leaks, and will log stack trace information for every memory allocation. This MR will have significant performance impacts since it logs a large amount of information for every allocation.

The statistics_resource_adaptor is a lightweight MR that adds simple counters to track the allocated bytes and allocation count. This MR will have significantly less of a performance impact but cannot identify the cause of memory leaks, only that they exist. This MR is also great at tracking peak memory usage and can be helpful in identifying areas that require large amounts of memory or helping developers measure memory usage reductions during optimization.

…rce_adaptor

GPUtester · 2020-11-10T00:31:57Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

harrism · 2020-11-10T04:06:50Z

o track usage per test in pytest (the main requirement from cuML) I added a reset_allocation_counts() function which will reset all values back to 0. This has a side effect of making it possible to get negative values for current_bytes and current_count (which needed to be changed to ssize_t)
1. I'm not sure how the rmm team will feel about allowing negative numbers for count/bytes totals. I experimented with a couple other designs (such as a push/pop system) that increased the complexity of the design and made it difficult to track the peak allocation correctly.

2. Technically, negative values were possible before by passing the incorrect `bytes` value to `deallocate()` (which would wrap the `size_t` value)

If you must have a reset method, why not just reset the peak and total, but not the current allocations? This way the current values can never disagree with the map. The ideal usage should be to reset after freeing all outstanding allocations, in which case resetting the peak and total would make all things zero.

mdemoret-nv · 2020-11-10T04:27:08Z

If you must have a reset method, why not just reset the peak and total, but not the current allocations? This way the current values can never disagree with the map.

I had considered this but I don't think it would work. Since the peak value is just peak = max(peak, current), if the current wasn't reset too, the peak won't be reset either since it will just take the current value next allocation.

The ideal usage should be to reset after freeing all outstanding allocations, in which case resetting the peak and total would make all things zero.

I see what you are saying, but in our test suite it's difficult to free all GPU memory between tests due to dataset fixtures and shared data between tests. When using this memory resource we created a pytest plugin that will:

Reset the Tracked MR
Run a single test
Garbage Collect
Get the Tracked MR counts

This allows us to determine the memory usage per test (important for running tests in parallel), and find memory leaks on a per test basis if the current allocation count doesn't return to 0.

With all that in mind, I would prefer to keep the reset functionality in there somewhere to allow our pytest plugin to work correctly. But I'm open to other ideas. There are more complex designs that could certainly work, I have just avoided them to keep the design simple and not impact performance.

kkraus14

Python / Cython lgtm, thanks!

harrism · 2020-11-10T10:31:09Z

FWIW, I think tests should start with a new memory resource each time the fixture is created. gtests do this. Therefore the memory is completely freed up between tests. This enables tests to be completely independent and not interfere with each other.

I don't feel OK with a current usage that can be negative.

include/rmm/mr/device/tracking_resource_adaptor.hpp

Co-authored-by: Rong Ou <[email protected]>

mdemoret-nv · 2020-11-10T18:13:35Z

I don't feel OK with a current usage that can be negative.

Got it. I'll see if I can work on another design which works for the cuML team that doesn't allow negative usage numbers.

FWIW, I think tests should start with a new memory resource each time the fixture is created. gtests do this. Therefore the memory is completely freed up between tests. This enables tests to be completely independent and not interfere with each other.

I was hesitant to create a new MR for each test since some of our tests configure the MR directly and I wasn't sure the impact this would have. Additionally, some of the resources held between tests may be resized within a test which would not be tracked by a new MR.

I'll do some experimentation and see what impact creating a new MR for each test has on the results we are looking for.

Thanks for the feedback.

harrism · 2020-11-11T00:47:43Z

@mdemoret-nv you may want to look at what we do in libcudf gtests as well.

…racking-resource-adaptor

mdemoret-nv · 2020-12-01T21:39:35Z

@harrism I incorporated your feedback and have updated the design to remove the reset() function which was the main driver for negative current usage values. Initially, I tried your suggestion to create a new memory resource before each test but that proved problematic because many of the tests in cuML will improve performance by caching datasets between tests. This causes Python segfaults since the dataset device_buffer will outlive the device_memory_resource it stores a pointer to. (There are a few other situations this can happen in as well, but there is no need to go into that). No matter what I did to try and change the lifetime of various python objects, our test suite always seemed to crash when creating a new memory resource per test.

Instead, I changed the design to store an internal "stack" of tracked values that you can then push/pop as needed. This doesn't eliminate the possibility of getting negative current allocations (which is technically possible with any device_memory_resource that has an "upstream") but it does significantly reduce the possibility. The methods get_total_allocation_counts() and get_allocated_bytes() should now work as you would expect and return the total values summed over the entire stack.

I've tested this in the cuML pytest suite and it works well and does what we need in order to find memory leaks and track memory usage per test. It would be great to get this re-reviewed. In addition, I have a few outstanding questions I could use your input on:

Are you ok with the names push_allocation_counts() and pop_allocation_counts()? I didn't think push() and pop() were descriptive enough.
I broke the string generation section from log_outstanding_allocations() into a new method get_outstanding_allocations_str() which allows getting the stack traces from python as well. Could use some input on your naming conventions here as well.
I also added C++ name demangling to stack_trace::operator<<() which is why there are some new diffs. This might be out of the scope of this PR. I can break it up into a separate one if needed.
Finally, I'm not sure what the difference between a MemoryResource and a ResourceAdaptor is. Can you clarify if the python wrapper should be called TrackingMemoryResource or TrackingResourceAdaptor?

Let me know if you have any questions. Thanks for your feedback so far.

harrism

I don't like the whole stack thing, it's overcomplicated. If you want to reset the tracking without resetting the upstream memory resource, then delete the tracking_resource_adaptor, and create a new one with the same upstream as the old one.

Keep the tracking simple. No need for a struct of counts, a stack, or any of that.

include/rmm/mr/device/tracking_resource_adaptor.hpp

harrism · 2020-12-03T10:36:08Z

include/rmm/mr/device/tracking_resource_adaptor.hpp

+   * @brief Returns an allocation_counts struct for this adaptor containing the
+   * total current, peak, and total number of bytes and allocation count for
+   * this adaptor regardless of any push/popped allocation_counts. Note: Because
+   * its possible to change memory resources at any time while maintaining the
+   * same upstream memory resource, its possible to have a negative allocation
+   * bytes or count if the number of deallocate() calls is greater than the
+   * number of allocate().


This is not brief. :) @brief should be one line. Separate detailed docs from @brief with a single blank line.

include/rmm/mr/device/tracking_resource_adaptor.hpp

harrism · 2020-12-03T10:41:10Z

include/rmm/mr/device/tracking_resource_adaptor.hpp

+   * this adaptor regardless of any push/popped allocation_counts. Note: Because
+   * its possible to change memory resources at any time while maintaining the
+   * same upstream memory resource, its possible to have a negative allocation


I don't understand the note, or the reason for a stack. You can't change the upstream for a resource_adaptor, you would have to create a new resource adaptor. You can't have more deallocations than allocations without an exception. This seems way overcomplicated.

harrism · 2020-12-03T10:44:59Z

I'm very sorry for the delay in reviewing @mdemoret-nv. Holidays and then life (outside work) have gotten in the way for the past week+.

kkraus14 · 2020-12-03T16:52:41Z

Should we push this to 0.18 to avoid changing tracking_resource_adaptor right before release?

harrism · 2021-05-27T00:10:50Z

include/rmm/mr/device/tracking_resource_adaptor.hpp

+        auto allocated_bytes = found->second.allocation_size;
+
+        if (allocated_bytes != bytes) {
+          // Don't throw but log an error. Throwing in a descructor (or any noexcept) will call


Suggested change

// Don't throw but log an error. Throwing in a descructor (or any noexcept) will call

// Don't throw but log an error. Throwing in a destructor (or any noexcept) will call

harrism · 2021-05-27T00:11:56Z

I still think statistics and leak detection should be in two separate memory resource adaptors. Moving to 21.08

…nto a new adaptor statistics_resource_adapter

harrism

The latest changes make me happy! Nice separation of concerns. Just a couple of comments.

include/rmm/mr/device/tracking_resource_adaptor.hpp

harrism · 2021-06-02T11:29:45Z

CMakeLists.txt

@@ -81,6 +81,7 @@ endif(CUDA_STATIC_RUNTIME)

 target_link_libraries(rmm INTERFACE rmm::Thrust)
 target_link_libraries(rmm INTERFACE spdlog::spdlog_header_only)
+target_link_libraries(rmm INTERFACE dl)


Is this needed? Quick google shows that dladdr now lives in libc rather than libdl?

With stack traces enabled, this was needed to compile the tests (original comment). Keith and I briefly discussed this here: #626 (comment).

Can you send me the link where you saw that dladdr has moved? All I am seeing from this link is:

Link with -ldl.

Nevermind, the docs I found were not for linux -- Solaris and something called illumos. As I said, it was a quick google.

Have you tested that all is well in libcudf, cuML, etc. when this library is linked here? Note that the other target_link_libraries for RMM are all header-only, which is why this one has me worried (RMM is a header-only library).

@mdemoret reports cuML builds and tests fine against this PR.

harrism

Nice work!

harrism · 2021-06-03T22:05:22Z

Need to update the title and description before merging.

mdemoret-nv · 2021-06-07T18:43:32Z

The following is the new description for this PR since it has changed so much during development. Leaving the PR description (aka the first comment) in place to preserve history, for now. New description:

This PR updates the C++ tracking_resource_adaptor with stack trace information and also adds a new MR, statistics_resource_adaptor. Summary of all changes:

tracking_resource_adaptor changes:
- Added Cython wrapper rmm.mr.TrackingResourceAdaptor which wraps all available methods
- Updated tracking_resource_adaptor to correctly log stack trace information with capture_stacks=True
Added statistics_resource_adaptor memory resource:
- This MR will keep track of the current, peak and total allocated bytes and number of allocations
- Added Cython wrapper rmm.mr.StatisticsResourceAdaptor which wraps all available methods

These two MR can be used separately and together to track memory allocations, check for memory leaks, and identify incorrect deallocations. While both MRs can track the current number of allocations, they have different areas of focus.

The tracking_resource_adaptor is designed more towards identifying and fixing memory leaks, and will log stack trace information for every memory allocation. This MR will have significant performance impacts since it logs a large amount of information for every allocation.

The statistics_resource_adaptor is a lightweight MR that adds simple counters to track the allocated bytes and allocation count. This MR will have significantly less of a performance impact but cannot identify the cause of memory leaks, only that they exist. This MR is also great at tracking peak memory usage and can be helpful in identifying areas that require large amounts of memory or helping developers measure memory usage reductions during optimization.

harrism · 2021-06-07T22:44:01Z

The updated description needs to be in the first (original) comment or our merge scripts won't copy the right description into the changelog.

harrism · 2021-06-08T12:01:58Z

@gpucibot merge

Adding additional tracking info and cython bindings to tracking_resou…

c5e91d9

…rce_adaptor

mdemoret-nv requested review from a team as code owners November 10, 2020 00:31

mdemoret-nv requested review from rongou and cwharris November 10, 2020 00:31

mdemoret-nv added 3 commits November 9, 2020 17:32

Adding PR to CHANGELOG

b31bb2f

Style cleanup

533fe2a

Fixing incorrect Cython class name

cffe9bf

kkraus14 approved these changes Nov 10, 2020

View reviewed changes

jrhemstad requested changes Nov 10, 2020

View reviewed changes

rongou approved these changes Nov 10, 2020

View reviewed changes

include/rmm/mr/device/tracking_resource_adaptor.hpp Outdated Show resolved Hide resolved

Apply suggestions from code review

e1218fa

Co-authored-by: Rong Ou <[email protected]>

mdemoret-nv added 4 commits November 19, 2020 14:08

Merge remote-tracking branch 'upstream/branch-0.17' into enh-extend-t…

25d5da3

…racking-resource-adaptor

Removed the reset() method, added ability to push/pull

976dabb

Merge remote-tracking branch 'upstream/branch-0.17' into enh-extend-t…

39d5e22

…racking-resource-adaptor

Style cleanup

ddd296a

harrism added the 3 - Ready for review Ready for review by team label Nov 26, 2020

mdemoret-nv mentioned this pull request Dec 1, 2020

[REVIEW] Fixing Incorrect Deallocation Size and Count Bugs rapidsai/raft#109

Merged

harrism requested changes Dec 3, 2020

View reviewed changes

github-actions bot removed the inactive-30d label May 25, 2021

kkraus14 approved these changes May 25, 2021

View reviewed changes

harrism reviewed May 27, 2021

View reviewed changes

mdemoret-nv added 4 commits June 1, 2021 21:10

Merge branch 'branch-21.08' into enh-extend-tracking-resource-adaptor

b5bbab4

Separated the statistics portion from the tracking_resource_adaptor i…

c7395e4

…nto a new adaptor statistics_resource_adapter

Getting tests to pass

1872ee2

Style cleanup

6b20fb8

mdemoret-nv changed the base branch from branch-21.06 to branch-21.08 June 2, 2021 04:53

harrism requested changes Jun 2, 2021

View reviewed changes

mdemoret-nv added 2 commits June 2, 2021 21:58

Improving method comment.

b281f9e

Style cleanup.

cab217d

harrism approved these changes Jun 3, 2021

View reviewed changes

harrism requested a review from jrhemstad June 3, 2021 05:13

jrhemstad approved these changes Jun 3, 2021

View reviewed changes

harrism removed 3 - Ready for review Ready for review by team 5 - Merge After Dependencies Depends on another PR: do not merge out of order labels Jun 3, 2021

mdemoret-nv changed the title ~~Add additional tracking info and cython bindings to tracking_resource_adaptor~~ Add Statistics Resource Adaptor and cython bindings to tracking_resource_adaptor and statistics_resource_adaptor Jun 7, 2021

galipremsagar approved these changes Jun 8, 2021

View reviewed changes

rapids-bot bot merged commit 2f85111 into rapidsai:branch-21.08 Jun 8, 2021

quasiben mentioned this pull request Aug 2, 2021

gpuCI follow-up dask/dask#7975

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Statistics Resource Adaptor and cython bindings to `tracking_resource_adaptor` and `statistics_resource_adaptor` #626

Add Statistics Resource Adaptor and cython bindings to `tracking_resource_adaptor` and `statistics_resource_adaptor` #626

mdemoret-nv commented Nov 10, 2020 •

edited

Loading

GPUtester commented Nov 10, 2020

harrism commented Nov 10, 2020

mdemoret-nv commented Nov 10, 2020

kkraus14 left a comment

harrism commented Nov 10, 2020 •

edited

Loading

mdemoret-nv commented Nov 10, 2020

harrism commented Nov 11, 2020

mdemoret-nv commented Dec 1, 2020

harrism left a comment

harrism Dec 3, 2020

harrism Dec 3, 2020

harrism commented Dec 3, 2020

kkraus14 commented Dec 3, 2020

harrism May 27, 2021

harrism commented May 27, 2021 •

edited

Loading

harrism left a comment

harrism Jun 2, 2021

mdemoret-nv Jun 2, 2021

harrism Jun 2, 2021

harrism Jun 2, 2021

harrism Jun 3, 2021

harrism left a comment

harrism commented Jun 3, 2021

mdemoret-nv commented Jun 7, 2021

harrism commented Jun 7, 2021

harrism commented Jun 8, 2021

	// Don't throw but log an error. Throwing in a descructor (or any noexcept) will call
	// Don't throw but log an error. Throwing in a destructor (or any noexcept) will call

Add Statistics Resource Adaptor and cython bindings to tracking_resource_adaptor and statistics_resource_adaptor #626

Add Statistics Resource Adaptor and cython bindings to tracking_resource_adaptor and statistics_resource_adaptor #626

Conversation

mdemoret-nv commented Nov 10, 2020 • edited Loading

GPUtester commented Nov 10, 2020

harrism commented Nov 10, 2020

mdemoret-nv commented Nov 10, 2020

kkraus14 left a comment

Choose a reason for hiding this comment

harrism commented Nov 10, 2020 • edited Loading

mdemoret-nv commented Nov 10, 2020

harrism commented Nov 11, 2020

mdemoret-nv commented Dec 1, 2020

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism commented Dec 3, 2020

kkraus14 commented Dec 3, 2020

Choose a reason for hiding this comment

harrism commented May 27, 2021 • edited Loading

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

harrism commented Jun 3, 2021

mdemoret-nv commented Jun 7, 2021

harrism commented Jun 7, 2021

harrism commented Jun 8, 2021

Add Statistics Resource Adaptor and cython bindings to `tracking_resource_adaptor` and `statistics_resource_adaptor` #626

Add Statistics Resource Adaptor and cython bindings to `tracking_resource_adaptor` and `statistics_resource_adaptor` #626

mdemoret-nv commented Nov 10, 2020 •

edited

Loading

harrism commented Nov 10, 2020 •

edited

Loading

harrism commented May 27, 2021 •

edited

Loading