Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVTX ranges for all CUB algorithms #1657

Merged
merged 5 commits into from
Apr 29, 2024

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Apr 22, 2024

Description

Add NVTX ranges to all CUB algorithms, so they can be visualized in a profiler such as Nsight Systems.

Here is how this could look like in Nsight Systems:
image

Fixes: #719

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • Inject recursion guard from unit tests

Open questions

  • Should we emit a range for a CUB algorithm call that only returns the amount of needed temporary storage? @gevtushenko says no.
  • Call the NVTX domain for all ranges (the row name in Nsight Systems) CCCL (like in Feature Request: Add NVTX Ranges to Thrust/CUB algorithms #719) or CUB/Thrust etc.? @jrhemstad says CCCL.
  • Does the dependency on nvtx3 introduce a link-time dependency and therefore require to specify additional build options? Answer: no.

@bernhardmgruber bernhardmgruber added the cub For all items related to CUB label Apr 22, 2024
cub/cub/device/device_reduce.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved

#pragma once

#include <cub/config.cuh>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This could be a quite useful facility also for other libraries, do we want to move this within cccl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your permission, sure! The scope of the PR is CUB though, so let's maybe move it in a second step? Maybe @gevtushenko let's me do the same work for thrust, then I could move it at this occasion.

Copy link

copy-pr-bot bot commented Apr 22, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@miscco
Copy link
Collaborator

miscco commented Apr 22, 2024

/ok to test

cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
@jrhemstad
Copy link
Collaborator

minor suggestion/note: This likely doesn't matter, but technically there is some extra performance overhead here because when you call nvtxPushRange("message") it deep copies the string every time. In order to avoid that deep-copy, you can use a "registered" message that copies the string once and just uses a handle everywhere instead.

See more here: https://nvidia.github.io/NVTX/doxygen-cpp/index.html#REGISTERED_MESSAGE

@bernhardmgruber
Copy link
Contributor Author

you can use a "registered" message that copies the string once and just uses a handle everywhere instead.

@jrhemstad addressed.

cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
Comment on lines 2973 to 2975
CUB_DETAIL_NVTX_RANGE_SCOPE_IF(d_temp_storage, GetName());
//return SortPairsNoNVTX<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>(
return SortPairs<KeyT, ValueT, BeginOffsetIteratorT, EndOffsetIteratorT>( // FIXME(bgruber): bug on purpose to validate unit tests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bug on purpose to see if the CI fires

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one and a few others got it, looks like this:

208/327 Test #208: cub.cpp17.test.device_segmented_sort_keys.lid_0 .....................Subprocess aborted***Exception:   0.22 sec
Recursion detected

cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/detail/nvtx.cuh Outdated Show resolved Hide resolved
cub/cub/device/device_merge_sort.cuh Show resolved Hide resolved
cub/cub/device/device_partition.cuh Show resolved Hide resolved
// The purpose of this test is to verify that CUB can use NVTX without any additional dependencies. It is built as part
// of the unit tests, but can also be built standalone:

// Compile (from current directory):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: did you verify that this builds in CI with the right command line options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So I considered hooking up this test as a special minimal build analogous to the header checks like we discussed offline. However, I noticed the header checks also link against the cub target in the same way as the non-Catch2 unit tests do. So adding this standalone test as a non-Catch2 test should be equally minimal.

For reference here are the compile and link command lines for building VERBOSE=1 make cub.cpp17.test.nvtx_standalone. Compile:

/usr/local/cuda/bin/nvcc
-forward-unknown-to-host-compiler
-ccbin=clang++-17
-DCUB_DETAIL_DEBUG_ENABLE_SYNC
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA
-DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP
-D_CCCL_NO_SYSTEM_HEADER
-I/home/bgruber/dev/cccl/cub/test
-I/home/bgruber/dev/cccl/cub/cub/cmake/../..
-I/home/bgruber/dev/cccl/libcudacxx/lib/cmake/libcudacxx/../../../include
-I/home/bgruber/dev/cccl/thrust/thrust/cmake/../..
-O2
-g
-DNDEBUG
-std=c++17
"--generate-code=arch=compute_89,code=[compute_89,sm_89]"
-Xcompiler=-Werror
-Xcompiler=-Wall
-Xcompiler=-Wextra
-Xcompiler=-Winit-self
-Xcompiler=-Woverloaded-virtual
-Xcompiler=-Wcast-qual
-Xcompiler=-Wpointer-arith
-Xcompiler=-Wunused-local-typedef
-Xcompiler=-Wvla
-Xcompiler=-Wgnu
-Xcompiler=-Wno-gnu-line-marker
-Xcompiler=-Wno-gnu-zero-variadic-macro-arguments
-Xcompiler=-Wno-unused-function
-Wreorder
-Xcudafe=--display_error_number
-Xcudafe=--promote_warnings
-Wno-deprecated-gpu-targets
-MD
-MT
cub/test/CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
-MF
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o.d
-x
cu
-c
/home/bgruber/dev/cccl/cub/test/test_nvtx_standalone.cu
-o
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o

Link:

/usr/bin/clang++-17
CMakeFiles/cub.cpp17.test.nvtx_standalone.dir/test_nvtx_standalone.cu.o
-o
../../bin/cub.cpp17.test.nvtx_standalone
-lcudadevrt
-lcudart_static
-lrt
-lpthread
-ldl
-L"/usr/local/cuda/targets/x86_64-linux/lib/stubs"
-L"/usr/local/cuda/targets/x86_64-linux/lib"

However, the test compiles and runs equally well if I just compile it with:
nvcc test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone

Fixes: NVIDIA#719

Co-authored-by: Michael Schellenberger Costa <[email protected]>
c:\cccl\cub\cub\detail\nvtx3.hpp(807): warning C4459: declaration of 'd' hides global declaration
C:/cccl/thrust/examples/cuda/global_device_vector.cu(30): note: see declaration of 'd'
@bernhardmgruber bernhardmgruber requested a review from a team as a code owner April 29, 2024 07:40
@bernhardmgruber bernhardmgruber merged commit dd6f124 into NVIDIA:main Apr 29, 2024
586 checks passed
@bernhardmgruber bernhardmgruber deleted the nvtx branch April 29, 2024 13:44
@bernhardmgruber
Copy link
Contributor Author

Because @gevtushenko asked again about whether NVTX requires any new link-time dependencies, like -ldl, I tested a few further scenarios:

Linux nvcc:
nvcc test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
nvcc will implicitly add -ldl to the link command line for g++.

Linux nvc++:
/opt/nvidia/hpc_sdk/Linux_x86_64/2024/compilers/bin/nvc++ -v test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
nvc++ will implicitly add -ldl to the link command line for acclnk

Linux clang CUDA:
clang++-17 test_nvtx_standalone.cu --cuda-gpu-arch=sm_89 -lcudart_static -ldl -lrt -pthread -L/usr/local/cuda/lib64 -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
Linking dl is explicitly required as specified by the clang CUDA-mode documentation:

Windows MSVC:
nvcc -v test_nvtx_standalone.cu -I../../cub -I../../thrust -I../../libcudacxx/include -o nvtx_standalone
Works.

All executables run and produce the expected output in Nsight Systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Feature Request: Add NVTX Ranges to Thrust/CUB algorithms
3 participants