Add NVTX ranges for all CUB algorithms #1657

bernhardmgruber · 2024-04-22T12:57:00Z

Description

Add NVTX ranges to all CUB algorithms, so they can be visualized in a profiler such as Nsight Systems.

Here is how this could look like in Nsight Systems:

Fixes: #719

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
Inject recursion guard from unit tests

Open questions

Should we emit a range for a CUB algorithm call that only returns the amount of needed temporary storage? @gevtushenko says no.
Call the NVTX domain for all ranges (the row name in Nsight Systems) CCCL (like in Feature Request: Add NVTX Ranges to Thrust/CUB algorithms #719) or CUB/Thrust etc.? @jrhemstad says CCCL.
Does the dependency on nvtx3 introduce a link-time dependency and therefore require to specify additional build options? Answer: no.

cub/cub/device/device_reduce.cuh

cub/cub/detail/nvtx.cuh

miscco · 2024-04-22T13:16:07Z

cub/cub/detail/nvtx.cuh

+
+#pragma once
+
+#include <cub/config.cuh>


Question: This could be a quite useful facility also for other libraries, do we want to move this within cccl?

With your permission, sure! The scope of the PR is CUB though, so let's maybe move it in a second step? Maybe @gevtushenko let's me do the same work for thrust, then I could move it at this occasion.

copy-pr-bot · 2024-04-22T13:54:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cub/cub/device/device_reduce.cuh

miscco · 2024-04-22T14:18:02Z

/ok to test

cub/cub/detail/nvtx.cuh

cub/cub/device/device_reduce.cuh

cub/cub/detail/nvtx.cuh

jrhemstad · 2024-04-22T23:11:09Z

minor suggestion/note: This likely doesn't matter, but technically there is some extra performance overhead here because when you call nvtxPushRange("message") it deep copies the string every time. In order to avoid that deep-copy, you can use a "registered" message that copies the string once and just uses a handle everywhere instead.

See more here: https://nvidia.github.io/NVTX/doxygen-cpp/index.html#REGISTERED_MESSAGE

cub/cub/detail/nvtx.cuh

libcudacxx/include/cuda/std/detail/libcxx/include/optional

bernhardmgruber · 2024-04-24T11:10:55Z

you can use a "registered" message that copies the string once and just uses a handle everywhere instead.

@jrhemstad addressed.

cub/cub/detail/nvtx.cuh

bernhardmgruber · 2024-04-24T17:24:57Z

cub/cub/device/device_segmented_sort.cuh

+    CUB_DETAIL_NVTX_RANGE_SCOPE_IF(d_temp_storage, GetName());
+    //return SortPairsNoNVTX<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>(
+    return SortPairs<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>( // FIXME(bgruber): bug on purpose to validate unit tests


That's a bug on purpose to see if the CI fires

And this one and a few others got it, looks like this:

208/327 Test #208: cub.cpp17.test.device_segmented_sort_keys.lid_0 .....................Subprocess aborted***Exception: 0.22 sec Recursion detected

cub/cub/detail/nvtx.cuh

libcudacxx/include/cuda/std/detail/libcxx/include/optional

cub/cub/detail/nvtx.cuh

cub/cub/device/device_merge_sort.cuh

cub/cub/device/device_partition.cuh

miscco · 2024-04-26T11:02:22Z

cub/test/test_nvtx_standalone.cu

+// The purpose of this test is to verify that CUB can use NVTX without any additional dependencies. It is built as part
+// of the unit tests, but can also be built standalone:
+
+// Compile (from current directory):


Question: did you verify that this builds in CI with the right command line options?

Right. So I considered hooking up this test as a special minimal build analogous to the header checks like we discussed offline. However, I noticed the header checks also link against the cub target in the same way as the non-Catch2 unit tests do. So adding this standalone test as a non-Catch2 test should be equally minimal.

For reference here are the compile and link command lines for building VERBOSE=1 make cub.cpp17.test.nvtx_standalone. Compile:

/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -ccbin=clang++-17 -DCUB_DETAIL_DEBUG_ENABLE_SYNC -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -D_CCCL_NO_SYSTEM_HEADER -I/home/bgruber/dev/cccl/cub/test -I/home/bgruber/dev/cccl/cub/cub/cmake/../.. -I/home/bgruber/dev/cccl/libcudacxx/lib/cmake/libcudacxx/../../../include -I/home/bgruber/dev/cccl/thrust/thrust/cmake/../.. -O2 -g -DNDEBUG -std=c++17 "--generate-code=arch=compute_89,code=[compute_89,sm_89]" -Xcompiler=-Werror -Xcompiler=-Wall -Xcompiler=-Wextra -Xcompiler=-Winit-self -Xcompiler=-Woverloaded-virtual -Xcompiler=-Wcast-qual -Xcompiler=-Wpointer-arith -Xcompiler=-Wunused-local-typedef -Xcompiler=-Wvla -Xcompiler=-Wgnu -Xcompiler=-Wno-gnu-line-marker -Xcompiler=-Wno-gnu-zero-variadic-macro-arguments -Xcompiler=-Wno-unused-function -Wreorder -Xcudafe=--display_error_number -Xcudafe=--promote_warnings -Wno-deprecated-gpu-targets -MD -MT cub/test/CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o -MF CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o.d -x cu -c /home/bgruber/dev/cccl/cub/test/test_nvtx_standalone.cu -o CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o

Link:

/usr/bin/clang++-17 CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o -o ../../bin/cub.cpp17.test.nvtx_standalone -lcudadevrt -lcudart_static -lrt -lpthread -ldl -L"/usr/local/cuda/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda/targets/x86_64-linux/lib"

However, the test compiles and runs equally well if I just compile it with:
nvcc test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone

Fixes: NVIDIA#719 Co-authored-by: Michael Schellenberger Costa <[email protected]>

c:\cccl\cub\cub\detail\nvtx3.hpp(807): warning C4459: declaration of 'd' hides global declaration C:/cccl/thrust/examples/cuda/global_device_vector.cu(30): note: see declaration of 'd'

bernhardmgruber · 2024-04-29T19:27:47Z

Because @gevtushenko asked again about whether NVTX requires any new link-time dependencies, like -ldl, I tested a few further scenarios:

Linux nvcc:
nvcc test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
nvcc will implicitly add -ldl to the link command line for g++.

Linux nvc++:
/opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/bin/nvc++ -v test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
nvc++ will implicitly add -ldl to the link command line for acclnk

Linux clang CUDA:
clang++-17 test_nvtx_standalone.cu --cuda-gpu-arch=sm_89 -lcudart_static -ldl -lrt -pthread -L/usr/local/cuda/lib64 -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
Linking dl is explicitly required as specified by the clang CUDA-mode documentation:

Windows MSVC:
nvcc -v test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
Works.

All executables run and produce the expected output in Nsight Systems.

bernhardmgruber added the cub For all items related to CUB label Apr 22, 2024

miscco requested changes Apr 22, 2024

View reviewed changes

miscco reviewed Apr 22, 2024

View reviewed changes

bernhardmgruber force-pushed the nvtx branch from 7c58c66 to a74fc2d Compare April 22, 2024 13:54

bernhardmgruber force-pushed the nvtx branch from a74fc2d to 665854f Compare April 22, 2024 13:59

bernhardmgruber commented Apr 22, 2024

View reviewed changes

cub/cub/device/device_reduce.cuh Outdated Show resolved Hide resolved

bernhardmgruber force-pushed the nvtx branch 4 times, most recently from 00260d0 to f9de904 Compare April 22, 2024 16:57

jrhemstad reviewed Apr 22, 2024

View reviewed changes