-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NVTX ranges for all CUB algorithms #1657
Conversation
|
||
#pragma once | ||
|
||
#include <cub/config.cuh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: This could be a quite useful facility also for other libraries, do we want to move this within cccl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your permission, sure! The scope of the PR is CUB though, so let's maybe move it in a second step? Maybe @gevtushenko let's me do the same work for thrust, then I could move it at this occasion.
/ok to test |
00260d0
to
f9de904
Compare
minor suggestion/note: This likely doesn't matter, but technically there is some extra performance overhead here because when you call See more here: https://nvidia.github.io/NVTX/doxygen-cpp/index.html#REGISTERED_MESSAGE |
25c2423
to
c963082
Compare
f16a826
to
21e6697
Compare
@jrhemstad addressed. |
CUB_DETAIL_NVTX_RANGE_SCOPE_IF(d_temp_storage, GetName()); | ||
//return SortPairsNoNVTX<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>( | ||
return SortPairs<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>( // FIXME(bgruber): bug on purpose to validate unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bug on purpose to see if the CI fires
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one and a few others got it, looks like this:
208/327 Test #208: cub.cpp17.test.device_segmented_sort_keys.lid_0 .....................Subprocess aborted***Exception: 0.22 sec
Recursion detected
b9585ee
to
9ea100e
Compare
// The purpose of this test is to verify that CUB can use NVTX without any additional dependencies. It is built as part | ||
// of the unit tests, but can also be built standalone: | ||
|
||
// Compile (from current directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: did you verify that this builds in CI with the right command line options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So I considered hooking up this test as a special minimal build analogous to the header checks like we discussed offline. However, I noticed the header checks also link against the cub target in the same way as the non-Catch2 unit tests do. So adding this standalone test as a non-Catch2 test should be equally minimal.
For reference here are the compile and link command lines for building VERBOSE=1 make cub.cpp17.test.nvtx_standalone
. Compile:
/usr/local/cuda/bin/nvcc
-forward-unknown-to-host-compiler
-ccbin=clang++-17
-DCUB_DETAIL_DEBUG_ENABLE_SYNC
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA
-DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP
-D_CCCL_NO_SYSTEM_HEADER
-I/home/bgruber/dev/cccl/cub/test
-I/home/bgruber/dev/cccl/cub/cub/cmake/../..
-I/home/bgruber/dev/cccl/libcudacxx/lib/cmake/libcudacxx/../../../include
-I/home/bgruber/dev/cccl/thrust/thrust/cmake/../..
-O2
-g
-DNDEBUG
-std=c++17
"--generate-code=arch=compute_89,code=[compute_89,sm_89]"
-Xcompiler=-Werror
-Xcompiler=-Wall
-Xcompiler=-Wextra
-Xcompiler=-Winit-self
-Xcompiler=-Woverloaded-virtual
-Xcompiler=-Wcast-qual
-Xcompiler=-Wpointer-arith
-Xcompiler=-Wunused-local-typedef
-Xcompiler=-Wvla
-Xcompiler=-Wgnu
-Xcompiler=-Wno-gnu-line-marker
-Xcompiler=-Wno-gnu-zero-variadic-macro-arguments
-Xcompiler=-Wno-unused-function
-Wreorder
-Xcudafe=--display_error_number
-Xcudafe=--promote_warnings
-Wno-deprecated-gpu-targets
-MD
-MT
cub/test/CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
-MF
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o.d
-x
cu
-c
/home/bgruber/dev/cccl/cub/test/test_nvtx_standalone.cu
-o
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
Link:
/usr/bin/clang++-17
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
-o
../../bin/cub.cpp17.test.nvtx_standalone
-lcudadevrt
-lcudart_static
-lrt
-lpthread
-ldl
-L"/usr/local/cuda/targets/x86_64-linux/lib/stubs"
-L"/usr/local/cuda/targets/x86_64-linux/lib"
However, the test compiles and runs equally well if I just compile it with:
nvcc test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
Fixes: NVIDIA#719 Co-authored-by: Michael Schellenberger Costa <[email protected]>
c:\cccl\cub\cub\detail\nvtx3.hpp(807): warning C4459: declaration of 'd' hides global declaration C:/cccl/thrust/examples/cuda/global_device_vector.cu(30): note: see declaration of 'd'
Because @gevtushenko asked again about whether NVTX requires any new link-time dependencies, like Linux nvcc: Linux nvc++: Linux clang CUDA: Windows MSVC: All executables run and produce the expected output in Nsight Systems. |
Description
Add NVTX ranges to all CUB algorithms, so they can be visualized in a profiler such as Nsight Systems.
Here is how this could look like in Nsight Systems:
Fixes: #719
Checklist
Open questions
CCCL
(like in Feature Request: Add NVTX Ranges to Thrust/CUB algorithms #719) orCUB
/Thrust
etc.? @jrhemstad saysCCCL
.