Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" #298

osayamenja · 2024-06-28T02:10:07Z

Running gdrcopy_pplat fails with Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" failed at pplat.cu:257.

See complete logs below

Click me

GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
device ptr: 0x7f6bb7a00000
gpu alloc fn: cuMemAlloc
map_d_ptr: 0x7f6bdc010000
info.va: 7f6bb7a00000
info.mapped_size: 65536
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7f6bdc010000
CPU does gdr_copy_to_mapping and GPU writes back via cuMemHostAlloc'd buffer.
Running 1000 iterations with data size 4 bytes.
Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" failed at pplat.cu:257

If useful, I have GPUDirectAsync configured and nvidia-peermem activated.

The text was updated successfully, but these errors were encountered:

osayamenja · 2024-06-28T02:18:39Z

Every other test works fine. Results are attached below.

gdrcopy_sanity

Total: 28, Passed: 28, Failed: 0, Waived: 0

gdrcopy_copybw

GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7f8293a00000
map_d_ptr: 0x7f82b2423000
info.va: 7f8293a00000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7f82b2423000
writing test, size=131072 offset=0 num_iters=10000
write BW: 8680.15MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 379.824MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

gdrcopy_copylat

GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
device ptr: 0x7f68b4000000
allocated size: 16777216
gpu alloc fn: cuMemAlloc

map_d_ptr: 0x7f68e1000000
info.va: 7f68b4000000
info.mapped_size: 16777216
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7f68e1000000

gdr_copy_to_mapping num iters for each size: 10000
WARNING: Measuring the API invocation overhead as observed by the CPU. Data might not be ordered all the way to the GPU internal visibility.
Test 			 Size(B) 	 Avg.Time(us)
gdr_copy_to_mapping 	        1 	      0.1021
gdr_copy_to_mapping 	        2 	      0.1021
gdr_copy_to_mapping 	        4 	      0.1020
gdr_copy_to_mapping 	        8 	      0.1027
gdr_copy_to_mapping 	       16 	      0.1028
gdr_copy_to_mapping 	       32 	      0.1020
gdr_copy_to_mapping 	       64 	      0.1037
gdr_copy_to_mapping 	      128 	      0.1152
gdr_copy_to_mapping 	      256 	      0.1187
gdr_copy_to_mapping 	      512 	      0.1374
gdr_copy_to_mapping 	     1024 	      0.1998
gdr_copy_to_mapping 	     2048 	      0.2580
gdr_copy_to_mapping 	     4096 	      0.4537
gdr_copy_to_mapping 	     8192 	      0.9071
gdr_copy_to_mapping 	    16384 	      1.8081
gdr_copy_to_mapping 	    32768 	      3.6079
gdr_copy_to_mapping 	    65536 	      7.2086
gdr_copy_to_mapping 	   131072 	     14.4026
gdr_copy_to_mapping 	   262144 	     28.7971
gdr_copy_to_mapping 	   524288 	     57.6994
gdr_copy_to_mapping 	  1048576 	    115.3423
gdr_copy_to_mapping 	  2097152 	    230.9106
gdr_copy_to_mapping 	  4194304 	    462.4430
gdr_copy_to_mapping 	  8388608 	    925.5537
gdr_copy_to_mapping 	 16777216 	   1851.2054

gdr_copy_from_mapping num iters for each size: 100
Test 			 Size(B) 	 Avg.Time(us)
gdr_copy_from_mapping 	        1 	      1.0830
gdr_copy_from_mapping 	        2 	      1.9810
gdr_copy_from_mapping 	        4 	      2.0370
gdr_copy_from_mapping 	        8 	      1.9580
gdr_copy_from_mapping 	       16 	      0.3330
gdr_copy_from_mapping 	       32 	      0.3730
gdr_copy_from_mapping 	       64 	      0.7690
gdr_copy_from_mapping 	      128 	      0.7300
gdr_copy_from_mapping 	      256 	      1.1240
gdr_copy_from_mapping 	      512 	      1.5940
gdr_copy_from_mapping 	     1024 	      3.5380
gdr_copy_from_mapping 	     2048 	      6.2421
gdr_copy_from_mapping 	     4096 	     11.5121
gdr_copy_from_mapping 	     8192 	     20.8663
gdr_copy_from_mapping 	    16384 	     40.7785
gdr_copy_from_mapping 	    32768 	     81.3170
gdr_copy_from_mapping 	    65536 	    158.9489
gdr_copy_from_mapping 	   131072 	    323.6429
gdr_copy_from_mapping 	   262144 	    710.4047
gdr_copy_from_mapping 	   524288 	   1422.3003
gdr_copy_from_mapping 	  1048576 	   2838.1456
gdr_copy_from_mapping 	  2097152 	   5688.7214
gdr_copy_from_mapping 	  4194304 	  12608.9298
gdr_copy_from_mapping 	  8388608 	  28866.0632
gdr_copy_from_mapping 	 16777216 	  57983.8880
unmapping buffer
unpinning buffer
closing gdrdrv

gdrcopy_apiperf

GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
device ptr: 0x7fde32000000
allocated size: 16777216
Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
65536	393.693800	4.849070	0.586010	7.779070	208.910540
Histogram of gdr_pin_buffer latency for 65536 bytes
[386.005000	-	772.010000]	85
[772.010000	-	1158.015000]	12
[1158.015000	-	1544.020000]	1
[1544.020000	-	1930.025000]	0
[1930.025000	-	2316.030000]	0
[2316.030000	-	2702.035000]	0
[2702.035000	-	3088.040000]	0
[3088.040000	-	3474.045000]	1
[3474.045000	-	3860.050000]	0
[3860.050000	-	4246.055000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
131072	390.340740	4.922060	0.586010	5.587070	209.741560
Histogram of gdr_pin_buffer latency for 131072 bytes
[374.904000	-	749.808000]	58
[749.808000	-	1124.712000]	37
[1124.712000	-	1499.616000]	3
[1499.616000	-	1874.520000]	0
[1874.520000	-	2249.424000]	0
[2249.424000	-	2624.328000]	1
[2624.328000	-	2999.232000]	0
[2999.232000	-	3374.136000]	0
[3374.136000	-	3749.040000]	0
[3749.040000	-	4123.944000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
262144	384.364700	5.000060	0.579010	5.766080	205.910470
Histogram of gdr_pin_buffer latency for 262144 bytes
[359.904000	-	719.808000]	15
[719.808000	-	1079.712000]	11
[1079.712000	-	1439.616000]	33
[1439.616000	-	1799.520000]	35
[1799.520000	-	2159.424000]	2
[2159.424000	-	2519.328000]	0
[2519.328000	-	2879.232000]	1
[2879.232000	-	3239.136000]	1
[3239.136000	-	3599.040000]	1
[3599.040000	-	3958.944000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
524288	385.165720	5.195100	0.588020	6.491070	205.447430
Histogram of gdr_pin_buffer latency for 524288 bytes
[361.104000	-	722.208000]	53
[722.208000	-	1083.312000]	42
[1083.312000	-	1444.416000]	2
[1444.416000	-	1805.520000]	1
[1805.520000	-	2166.624000]	1
[2166.624000	-	2527.728000]	0
[2527.728000	-	2888.832000]	0
[2888.832000	-	3249.936000]	0
[3249.936000	-	3611.040000]	0
[3611.040000	-	3972.144000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
1048576	400.635850	5.586050	0.584000	7.938140	210.336570
Histogram of gdr_pin_buffer latency for 1048576 bytes
[362.405000	-	724.810000]	96
[724.810000	-	1087.215000]	1
[1087.215000	-	1449.620000]	1
[1449.620000	-	1812.025000]	0
[1812.025000	-	2174.430000]	0
[2174.430000	-	2536.835000]	0
[2536.835000	-	2899.240000]	0
[2899.240000	-	3261.645000]	1
[3261.645000	-	3624.050000]	0
[3624.050000	-	3986.455000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
2097152	391.923760	8.034110	0.573000	13.988180	208.904540
Histogram of gdr_pin_buffer latency for 2097152 bytes
[386.905000	-	773.810000]	92
[773.810000	-	1160.715000]	4
[1160.715000	-	1547.620000]	1
[1547.620000	-	1934.525000]	1
[1934.525000	-	2321.430000]	0
[2321.430000	-	2708.335000]	0
[2708.335000	-	3095.240000]	0
[3095.240000	-	3482.145000]	0
[3482.145000	-	3869.050000]	1
[3869.050000	-	4255.955000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
4194304	396.802860	10.452100	0.576010	19.818270	209.164510
Histogram of gdr_pin_buffer latency for 4194304 bytes
[388.105000	-	776.210000]	98
[776.210000	-	1164.315000]	1
[1164.315000	-	1552.420000]	0
[1552.420000	-	1940.525000]	0
[1940.525000	-	2328.630000]	0
[2328.630000	-	2716.735000]	0
[2716.735000	-	3104.840000]	0
[3104.840000	-	3492.945000]	0
[3492.945000	-	3881.050000]	0
[3881.050000	-	4269.155000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
8388608	397.254870	14.905130	0.584010	31.263530	213.712470
Histogram of gdr_pin_buffer latency for 8388608 bytes
[370.704000	-	741.408000]	8
[741.408000	-	1112.112000]	14
[1112.112000	-	1482.816000]	69
[1482.816000	-	1853.520000]	6
[1853.520000	-	2224.224000]	0
[2224.224000	-	2594.928000]	2
[2594.928000	-	2965.632000]	0
[2965.632000	-	3336.336000]	0
[3336.336000	-	3707.040000]	0
[3707.040000	-	4077.744000]	0

Size(B)	pin.Time(us)	map.Time(us)	get_info.Time(us)	unmap.Time(us)	unpin.Time(us)
16777216	396.702820	25.480310	0.573010	54.088660	209.703560
Histogram of gdr_pin_buffer latency for 16777216 bytes
[379.205000	-	758.410000]	72
[758.410000	-	1137.615000]	20
[1137.615000	-	1516.820000]	5
[1516.820000	-	1896.025000]	1
[1896.025000	-	2275.230000]	1
[2275.230000	-	2654.435000]	0
[2654.435000	-	3033.640000]	0
[3033.640000	-	3412.845000]	0
[3412.845000	-	3792.050000]	0
[3792.050000	-	4171.255000]	0

closing gdrdrv

pakmarkthub · 2024-06-28T02:35:35Z

Hi @osayamenja,

Did you have NVCCFLAGS or CPPFLAGS set to something related to nvcc. This issue means that the CUDA kernel is launched incorrectly. In most cases, it is from a missing CUDA binary. For example, your NVCCFLAGS might have --gpu-code=sm_90. The CUDA binary that you compile out can be run on Hopper but not on Volta. You can try the compile flag below and rerun the new binary again.

make clean
NVCCFLAGS="--gpu-architecture=compute_70 --gpu-code=compute_70,sm_70" make -j4

osayamenja · 2024-06-28T02:46:07Z

Hey @pakmarkthub thanks for the quick response! I installed gdrcopy using the deb packages, so I am not using make to run the tests; I run the installed binary. That said, I tried rerunning gdrcopy_pplat with the NVCCFLAGS exported but the error persists. Let me know if you need any extra information.

pakmarkthub · 2024-06-28T02:55:16Z

If you use the deb packages, the binary should be compiled with the correct flags. Some requests / questions:

The gdrcopy_test package should specify the CUDA version it was compiled with. Please make sure that that CUDA version and the CUDA library on your system are compatible.
Can you make sure that you can run a simple CUDA kernel on your system? You can use this https://github.com/NVIDIA/cuda-samples. Please try Samples/0_Introduction/vectorAdd (compile and run it).

osayamenja · 2024-06-28T03:04:58Z

Running gdrcopy_test returns gdrcopy_test: command not found
Running the CUDA example you suggested works fine, see results below. I also want to mention, this may not be a high-priority issue (yet), since the downstream application I am using, NVSHMEM, works fine with gdrcopy so far.

vectorAdd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

osayamenja · 2024-06-28T03:08:37Z

Just in case you need these.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

nvidia-smi

Fri Jun 28 03:08:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000001:00:00.0 Off |                    0 |
| N/A   37C    P0              43W / 300W |    103MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000002:00:00.0 Off |                    0 |
| N/A   41C    P0              43W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000003:00:00.0 Off |                    0 |
| N/A   38C    P0              42W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000004:00:00.0 Off |                    0 |
| N/A   39C    P0              42W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000005:00:00.0 Off |                    0 |
| N/A   37C    P0              41W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000006:00:00.0 Off |                    0 |
| N/A   39C    P0              42W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000007:00:00.0 Off |                    0 |
| N/A   38C    P0              42W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000008:00:00.0 Off |                    0 |
| N/A   41C    P0              42W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                       
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                           33MiB |
|    1   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      1848      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

pakmarkthub · 2024-06-28T03:55:43Z

Thank you for the additional information. GDRCopy does not rely on CUDA. gdrcopy_pplat is just a benchmark application. It needs to launch a CUDA kernel because we are measuring a ping-pong latency between CPU and GPU.

Can you try compiling from source? You can still use libgdrapi.so as well as the gdrdrv driver from the deb packages. What you need to compile is just the test applications, but it might be easier to compile the whole project.

Running gdrcopy_test returns gdrcopy_test: command not found

No. I meant the gdrcopy-test deb package you are using. For example, you probably downloaded something like https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/ubuntu20_04/x64/gdrcopy-tests_2.4-1_amd64.Ubuntu20_04+cuda12.2.deb. This is a gdrcopy-test deb package compiled with CUDA12.2 on Ubuntu 20.04 for x86-64. This is just an example. Can you check the CUDA version that the gdrcopy-tests deb package you are using was compiled with?

osayamenja · 2024-06-28T06:03:05Z

I installed gdrcopy following the README instructions, meaning the script automatically detected my CUDA toolkit and ubuntu version, which is 20.04. I will try recompiling from source and get back soon, thanks for your effort and quick response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" #298

Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" #298

osayamenja commented Jun 28, 2024

osayamenja commented Jun 28, 2024

pakmarkthub commented Jun 28, 2024 •

edited

Loading

osayamenja commented Jun 28, 2024

pakmarkthub commented Jun 28, 2024

osayamenja commented Jun 28, 2024

osayamenja commented Jun 28, 2024

pakmarkthub commented Jun 28, 2024

osayamenja commented Jun 28, 2024 •

edited

Loading

Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" #298

Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" #298

Comments

osayamenja commented Jun 28, 2024

osayamenja commented Jun 28, 2024

pakmarkthub commented Jun 28, 2024 • edited Loading

osayamenja commented Jun 28, 2024

pakmarkthub commented Jun 28, 2024

osayamenja commented Jun 28, 2024

osayamenja commented Jun 28, 2024

pakmarkthub commented Jun 28, 2024

osayamenja commented Jun 28, 2024 • edited Loading

pakmarkthub commented Jun 28, 2024 •

edited

Loading

osayamenja commented Jun 28, 2024 •

edited

Loading