Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvprof does not detect kernel launches #371

Closed
mkarikom opened this issue Aug 17, 2020 · 1 comment
Closed

nvprof does not detect kernel launches #371

mkarikom opened this issue Aug 17, 2020 · 1 comment

Comments

@mkarikom
Copy link

mkarikom commented Aug 17, 2020

nvprof runs without error and CUDA.jl gives expected behavior, but nvprof cannot see anything.

Julia environment:

(@v1.4) pkg> status
Status `~/.julia/environments/v1.4/Project.toml`
  [c52e3926] Atom v0.12.19
  [052768ef] CUDA v1.2.1
  [e5e0dc1b] Juno v0.8.3
  [14b8a8f1] PkgTemplates v0.7.8
  [295af30f] Revise v2.7.3

nvprof version:

(base) au@a1:~$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.243 (21)

hardware and library support:

(@v1.4) pkg> test CUDA
    Testing CUDA
Status `/tmp/jl_c4Y0CF/Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.1
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [864edb3b] DataStructures v0.17.20
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.9.4
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v5.0.0
  [61eb1bfa] GPUCompiler v0.5.5
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [856f044c] MKL_jll v2020.2.254+0
  [1914dd2f] MacroTools v0.5.5
  [872c559c] NNlib v0.7.4
  [77ba4419] NaNMath v0.3.4
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [2a0f44e3] Base64 
  [ade2ca70] Dates 
  [8ba89e20] Distributed 
  [b77e0a4c] InteractiveUtils 
  [76f85450] LibGit2 
  [8f399da3] Libdl 
  [37e2e46d] LinearAlgebra 
  [56ddb016] Logging 
  [d6f4376e] Markdown 
  [44cfe95a] Pkg 
  [de0858da] Printf 
  [3fa0cd96] REPL 
  [9a3f8284] Random 
  [ea8e919c] SHA 
  [9e88b42a] Serialization 
  [6462fe0b] Sockets 
  [2f01184e] SparseArrays 
  [10745b16] Statistics 
  [8dfed614] Test 
  [cf7118a7] UUIDs 
  [4ec0a83e] Unicode 
┌ Info: System information:
│ CUDA toolkit 10.2.89, artifact installation
│ CUDA driver 10.2.0
│ NVIDIA driver 440.100.0
│ 
│ Libraries: 
│ - CUBLAS: 10.2.2
│ - CURAND: 10.1.2
│ - CUFFT: 10.1.2
│ - CUSOLVER: 10.3.0
│ - CUSPARSE: 10.3.1
│ - CUPTI: 12.0.0
│ - NVML: 10.0.0+440.100
│ - CUDNN: 8.0.1 (for CUDA 10.2.0)
│ - CUTENSOR: 1.2.0 (for CUDA 10.2.0)
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ 1 device(s):
└ - GeForce GTX 1080 Ti (sm_61, 8.982 GiB / 10.913 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce GTX 1080 Ti (UUID ad1d87a4-88f9-0a82-edf0-3931aa888c68)
[ Info: Skipping the following tests: cutensor, device/wmma
                                     |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                        (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                   (2) |     2.95 |   0.00 |  0.0 |       0.00 |   135.00 |   0.05 |  1.7 |     160.93 |   868.43 |
apiutils                         (3) |     0.71 |   0.00 |  0.0 |       0.00 |   135.00 |   0.03 |  4.4 |      90.76 |   876.82 |
curand                           (2) |     0.25 |   0.00 |  0.0 |       0.00 |   141.00 |   0.00 |  0.0 |      28.72 |   876.84 |
codegen                          (6) |    16.40 |   0.26 |  1.6 |       0.00 |   175.00 |   0.94 |  5.7 |    1818.62 |  1049.93 |
broadcast                        (5) |    34.20 |   0.33 |  1.0 |       0.00 |   149.00 |   1.51 |  4.4 |    3419.01 |   998.46 |
cufft                            (9) |    35.26 |   0.31 |  0.9 |     144.16 |   303.00 |   1.93 |  5.5 |    4372.68 |  1208.13 |
cusparse                         (2) |    50.29 |   0.29 |  0.6 |       4.46 |   209.00 |   2.43 |  4.8 |    6067.47 |  1276.17 |
iterator                         (2) |     2.03 |   0.00 |  0.0 |       1.25 |   211.00 |   0.07 |  3.3 |     227.10 |  1276.34 |
memory                           (2) |     1.43 |   0.00 |  0.0 |       0.00 |   209.00 |   0.36 | 25.2 |     110.34 |  1276.36 |
array                            (4) |    56.83 |   0.33 |  0.6 |       5.20 |   155.00 |   2.73 |  4.8 |    6732.05 |  1109.42 |
nvml                             (4) |     0.46 |   0.00 |  0.0 |       0.00 |   155.00 |   0.00 |  0.0 |      49.10 |  1113.06 |
nvtx                             (4) |     0.46 |   0.00 |  0.0 |       0.00 |   155.00 |   0.03 |  5.7 |      73.85 |  1113.19 |
pointer                          (4) |     0.10 |   0.00 |  0.0 |       0.00 |   155.00 |   0.00 |  0.0 |       6.40 |  1113.26 |
nnlib                            (2) |     3.23 |   0.16 |  5.0 |       0.00 |   253.00 |   0.13 |  4.2 |     411.07 |  1408.39 |
random                           (4) |     4.63 |   0.00 |  0.0 |       0.02 |   155.00 |   0.17 |  3.6 |     492.31 |  1116.78 |
cublas                           (7) |    68.23 |   0.38 |  0.6 |      11.12 |   211.00 |   3.34 |  4.9 |    8936.22 |  1277.85 |
cudnn                            (8) |    68.74 |   0.32 |  0.5 |       0.60 |   261.00 |   2.81 |  4.1 |    7509.71 |  1547.84 |
cusolver                         (3) |    68.11 |   0.36 |  0.5 |    1128.68 |   321.00 |   3.46 |  5.1 |    8741.84 |  1404.84 |
cudadrv/context                  (3) |     0.65 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |      32.95 |  1526.70 |
utils                            (8) |     1.21 |   0.00 |  0.0 |       0.00 |   261.00 |   0.05 |  4.4 |     141.72 |  1547.84 |
cudadrv/devices                  (3) |     0.31 |   0.00 |  0.0 |       0.00 |   321.00 |   0.01 |  4.8 |      39.47 |  1526.70 |
cudadrv/errors                   (8) |     0.18 |   0.00 |  0.0 |       0.00 |   261.00 |   0.00 |  0.0 |      22.18 |  1547.84 |
threading                        (7) |     2.10 |   0.00 |  0.1 |       4.69 |   221.00 |   0.06 |  3.0 |     198.21 |  1291.28 |
cudadrv/events                   (3) |     0.14 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |      14.38 |  1526.70 |
cudadrv/module                   (3) |     0.37 |   0.00 |  0.0 |       0.00 |   321.00 |   0.02 |  4.4 |      46.39 |  1526.70 |
cudadrv/occupancy                (3) |     0.11 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |       8.27 |  1526.70 |
cudadrv/execution                (8) |     0.90 |   0.00 |  0.0 |       0.00 |   261.00 |   0.04 |  4.2 |     107.33 |  1547.84 |
cudadrv/profile                  (3) |     0.25 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |      48.15 |  1526.70 |
cudadrv/version                  (3) |     0.01 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |       0.07 |  1526.70 |
cudadrv/stream                   (8) |     0.20 |   0.00 |  0.0 |       0.00 |   261.00 |   0.00 |  0.0 |      23.67 |  1547.84 |
cudadrv/memory                   (7) |     1.85 |   0.00 |  0.0 |       0.00 |   213.00 |   0.08 |  4.2 |     206.28 |  1292.60 |
statistics                       (2) |    13.65 |   0.00 |  0.0 |       0.00 |   253.00 |   0.61 |  4.5 |    1656.35 |  1452.93 |
device/array                     (8) |     3.35 |   0.00 |  0.0 |       0.00 |   261.00 |   0.13 |  4.0 |     355.86 |  1547.84 |
cusolver/cusparse                (3) |     6.51 |   0.00 |  0.0 |       0.19 |   387.00 |   0.19 |  2.9 |     583.47 |  1614.06 |
device/pointer                   (2) |     5.85 |   0.00 |  0.0 |       0.00 |   253.00 |   0.20 |  3.4 |     640.21 |  1459.73 |
gpuarrays/math                   (3) |     1.97 |   0.00 |  0.0 |       0.00 |   387.00 |   0.07 |  3.5 |     245.62 |  1620.85 |
texture                          (4) |    17.47 |   0.00 |  0.0 |       0.08 |   159.00 |   0.92 |  5.3 |    2414.18 |  1118.00 |
gpuarrays/input output           (2) |     2.79 |   0.00 |  0.0 |       0.00 |   253.00 |   0.22 |  8.0 |     535.03 |  1467.05 |
gpuarrays/interface              (4) |     1.74 |   0.00 |  0.0 |       0.00 |   159.00 |   0.06 |  3.2 |     191.04 |  1118.62 |
gpuarrays/value constructors     (3) |     3.92 |   0.00 |  0.0 |       0.00 |   389.00 |   0.12 |  3.0 |     367.18 |  1631.52 |
gpuarrays/uniformscaling         (4) |     5.93 |   0.00 |  0.0 |       0.01 |   187.00 |   0.21 |  3.6 |     626.72 |  1134.04 |
gpuarrays/indexing               (8) |    13.93 |   0.00 |  0.0 |       0.13 |   261.00 |   0.60 |  4.3 |    1715.34 |  1547.84 |
gpuarrays/iterator constructors  (2) |    10.05 |   0.00 |  0.0 |       0.02 |   253.00 |   0.46 |  4.6 |    1423.34 |  1539.64 |
gpuarrays/conversions            (4) |     3.72 |   0.00 |  0.0 |       0.01 |   183.00 |   0.18 |  4.8 |     596.07 |  1143.72 |
gpuarrays/constructors           (2) |     1.20 |   0.00 |  0.3 |       0.04 |   253.00 |   0.00 |  0.0 |      72.61 |  1541.23 |
gpuarrays/fft                    (8) |     5.85 |   0.00 |  0.0 |       6.01 |   339.00 |   0.26 |  4.5 |     769.86 |  1726.85 |
forwarddiff                      (9) |    63.56 |   0.20 |  0.3 |       0.00 |   305.00 |   0.89 |  1.4 |    2826.93 |  1373.03 |
gpuarrays/base                   (2) |    12.76 |   0.00 |  0.0 |      17.61 |   277.00 |   0.90 |  7.1 |    1878.78 |  1610.77 |
gpuarrays/random                 (4) |    14.56 |   0.00 |  0.0 |       0.02 |   183.00 |   0.42 |  2.9 |    1243.95 |  1203.35 |
examples                         (6) |    96.05 |   0.00 |  0.0 |       0.00 |   175.00 |   0.06 |  0.1 |      29.75 |  1056.34 |
gpuarrays/linear algebra         (3) |    49.42 |   0.01 |  0.0 |       1.43 |   383.00 |   1.42 |  2.9 |    4547.99 |  1810.02 |
execution                        (5) |   106.79 |   0.00 |  0.0 |       0.15 |   219.00 |   0.93 |  0.9 |    2890.57 |  1240.08 |
device/intrinsics                (7) |    73.27 |   0.00 |  0.0 |       0.01 |   747.00 |   1.35 |  1.8 |    4934.04 |  1470.76 |
gpuarrays/broadcasting           (9) |    54.74 |   0.00 |  0.0 |       1.19 |   297.00 |   2.27 |  4.1 |    7386.15 |  1506.79 |
gpuarrays/mapreduce essentials   (8) |    83.10 |   0.01 |  0.0 |       3.19 |   351.00 |   3.44 |  4.1 |   11834.02 |  1962.21 |
gpuarrays/mapreduce derivatives  (2) |   125.38 |   0.01 |  0.0 |       3.06 |   309.00 |   3.78 |  3.0 |   14215.29 |  1942.89 |

Test Summary: | Pass  Broken  Total
  Overall     | 8008       2   8010
    SUCCESS
    Testing CUDA tests passed 

script tested: scratch.jl (part of the CUDA.jl/test for mapreduce)

using Pkg
Pkg.activate("./")

using CUDA

function mapreduce_gpu(f::Function, op::Function, A::CuArray{T, N}) where {T, N}
    OT = Int
    v0 = 0

    out = CuArray{OT}(undef, (1,))
    @cuda threads=64 reduce_kernel(f, op, v0, A, out)
    Array(out)[1]
end

function reduce_kernel(f, op, v0::T, A, result) where {T}
    tmp_local = @cuStaticSharedMem(T, 64)
    acc = v0

    # Loop sequentially over chunks of input vector
    i = threadIdx().x
    while i <= length(A)
        element = f(A[i])
        acc = op(acc, element)
        i += blockDim().x
    end

    return
end

A = rand(1:10, 100)
dA = CuArray(A)

mapreduce(identity, +, A)

result of running scratch.jl in repl:

julia> include("/mnt/evo512/insync/Software_a1/testCUDA/scratch.jl")
 Activating new environment at `~/Project.toml`
502

result of running nvprof on scratch.jl:

(base) au@a1:~$ nvprof --profile-from-start off julia /mnt/evo512/insync/Software_a1/testCUDA/scratch.jl 
 Activating new environment at `~/~/Project.toml`
==275468== NVPROF is profiling process 275468, command: julia /mnt/evo512/insync/Software_a1/testCUDA/scratch.jl
==275468== Profiling application: julia /mnt/evo512/insync/Software_a1/testCUDA/scratch.jl
==275468== Profiling result:
No kernels were profiled.
No API activities were profiled.

expected result is something along the lines of CUDA.jl Introduction to profiling:

==2574== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  247.61ms         1  247.61ms  247.61ms  247.61ms  ptxcall_gpu_add1__1
      API calls:   99.54%  247.83ms         1  247.83ms  247.83ms  247.83ms  cuEventSynchronize
                    0.46%  1.1343ms         1  1.1343ms  1.1343ms  1.1343ms  cuLaunchKernel
                    0.00%  4.9490us         1  4.9490us  4.9490us  4.9490us  cuEventRecord
                    0.00%  4.4190us         1  4.4190us  4.4190us  4.4190us  cuEventCreate
                    0.00%     960ns         2     480ns     358ns     602ns  cuCtxGetCurrent
@mkarikom mkarikom added the bug Something isn't working label Aug 17, 2020
@maleadt
Copy link
Member

maleadt commented Aug 18, 2020

If you use --profile-from-start off you need to activate the profiler again using CUDA.@profile.

@maleadt maleadt closed this as completed Aug 18, 2020
@maleadt maleadt removed the bug Something isn't working label Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants