Support Turing Architecture #9

bayedieng · 2024-08-13T08:30:17Z

when building and attempting to launch the executable I am presented with the following error:

Device count: 1
[LibreCuda Debug]: rm_alloc failed with status: 25
[CUDA ERROR] at file /home/bdieng/src/LibreCuda/src/main.cpp:31: LIBRECUDA_ERROR_UNKNOWN

Perhaps there are some hardware/driver requirements which I don't posses that are required for this to work? Here is my GPU information:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:07:00.0 Off |                  N/A |
| 35%   30C    P8             20W /  260W |      16MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1555      G   /usr/lib/xorg/Xorg                              9MiB |
|    0   N/A  N/A      1672      G   /usr/bin/gnome-shell                            4MiB |
+-----------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

mikex86 · 2024-08-13T15:44:05Z

This unfortunately isn't a whole lot of information, I've added some better debugging information for failed RM_ALLOCs and RM_CTRLs.
If you find the time, feel free to pull latest changes and rerun and kindly provide the output again.

bayedieng · 2024-08-13T17:54:23Z

Thanks, Sure.

Device count: 1
[LibreCuda Debug]: /home/bdieng/src/LibreCuda/src/librecuda.cpp:284: RM_ALLOC failed with status rm_alloc(fd_ctl, ctx->device->compute_class, root, gpfifo, 0, nullptr, 0, nullptr)
[CUDA ERROR] at file /home/bdieng/src/LibreCuda/src/main.cpp:31: LIBRECUDA_ERROR_UNKNOWN

mikex86 · 2024-08-13T18:37:05Z

Ah, well, it's because only Ampere and Ada GPUs are currently supported.
I unfortunately don't have access to a Turing GPU, but I have attempted a naive port and you can try to run the demo to see if it crashes or not.
I have created a branch called "turing". Feel free to pull latest changes from said branch and see if it works.

bayedieng · 2024-08-13T19:02:39Z

That seems to fix it however, program now hangs as libreCuStreamAwait is called.

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000

mikex86 · 2024-08-13T19:08:38Z

Hmm, then it doesn't work yet. I expected something along these lines... I'll see what I can do

mikex86 · 2024-08-13T19:27:30Z

Can you go into cmdqueue.cpp line 149: and comment out as follows:

    LIBRECUDA_ERR_PROPAGATE(startExecution(COMPUTE));
    LIBRECUDA_ERR_PROPAGATE(awaitExecution());

    // setup copy queue
    /* BEGIN COMMENT HERE
    {
        LIBRECUDA_ERR_PROPAGATE(
                enqueue(makeNvMethod(4, NVC6C0_SET_OBJECT, 1), {get_dma_copy_type(ctx->device->compute_class)})
        );
        timelineCtr++;
    }

    LIBRECUDA_ERR_PROPAGATE(startExecution(DMA));
    LIBRECUDA_ERR_PROPAGATE(awaitExecution());
    END COMMENT HERE */

   // allocate kernargs page
    {
        LIBRECUDA_ERR_PROPAGATE(
   ...

What happens with these changes will help me debug what is going on.
With these changes, the code has a slight chance of working correctly...
If it does, I will know more.

bayedieng · 2024-08-13T20:47:43Z

still hangs when commented out:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000

For what it's worth the output does successfully exit when libreCuStreamAwait isn't called:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000
Dst value (post exec): 0.00000

mikex86 · 2024-08-13T21:06:44Z

What happens WITH libreCuStreamAwait but no libreCuLaunchKernel?

bayedieng · 2024-08-14T00:24:53Z

It manages to execute without issue:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000
Dst value (post exec): 0.00000

mikex86 · 2024-08-14T00:40:15Z

Right, that was expected. If you find the time, can you try to run with libreCuLaunchKernel again + replace libreCuStreamAwait with std::this_thread::sleep_for(std::chrono::milliseconds(100));?

you will need

#include <chrono>
#include <thread>

With this I'm trying to figure out if the issue is exclusively semaphore related, or if the FIFO is also screwed...

bayedieng · 2024-08-14T13:18:29Z

Builds Fine:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000
Dst value (post exec): 0.00000

mikex86 · 2024-08-14T13:20:13Z

yikes, this is sort of the worst case scenario I expected. I don't think I can debug this without access to a Turing GPU... I will have to put this on ice for now...

bayedieng · 2024-08-14T13:31:21Z

I was interested in interfacing with Cuda without NVIDIA's proprietary runtime as well at some point. Might submit a PR when I have time to delve in it.

bayedieng changed the title ~~RM Alloc Failure~~ Turing Architecture Unsupported Aug 13, 2024

bayedieng changed the title ~~Turing Architecture Unsupported~~ Support Turing Architecture Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Turing Architecture #9

Support Turing Architecture #9

bayedieng commented Aug 13, 2024 •

edited

Loading

mikex86 commented Aug 13, 2024

bayedieng commented Aug 13, 2024

mikex86 commented Aug 13, 2024 •

edited

Loading

bayedieng commented Aug 13, 2024

mikex86 commented Aug 13, 2024

mikex86 commented Aug 13, 2024

bayedieng commented Aug 13, 2024 •

edited

Loading

mikex86 commented Aug 13, 2024

bayedieng commented Aug 14, 2024

mikex86 commented Aug 14, 2024

bayedieng commented Aug 14, 2024

mikex86 commented Aug 14, 2024

bayedieng commented Aug 14, 2024

Support Turing Architecture #9

Support Turing Architecture #9

Comments

bayedieng commented Aug 13, 2024 • edited Loading

mikex86 commented Aug 13, 2024

bayedieng commented Aug 13, 2024

mikex86 commented Aug 13, 2024 • edited Loading

bayedieng commented Aug 13, 2024

mikex86 commented Aug 13, 2024

mikex86 commented Aug 13, 2024

bayedieng commented Aug 13, 2024 • edited Loading

mikex86 commented Aug 13, 2024

bayedieng commented Aug 14, 2024

mikex86 commented Aug 14, 2024

bayedieng commented Aug 14, 2024

mikex86 commented Aug 14, 2024

bayedieng commented Aug 14, 2024

bayedieng commented Aug 13, 2024 •

edited

Loading

mikex86 commented Aug 13, 2024 •

edited

Loading

bayedieng commented Aug 13, 2024 •

edited

Loading