Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Turing Architecture #9

Open
bayedieng opened this issue Aug 13, 2024 · 13 comments
Open

Support Turing Architecture #9

bayedieng opened this issue Aug 13, 2024 · 13 comments

Comments

@bayedieng
Copy link

bayedieng commented Aug 13, 2024

when building and attempting to launch the executable I am presented with the following error:

Device count: 1
[LibreCuda Debug]: rm_alloc failed with status: 25
[CUDA ERROR] at file /home/bdieng/src/LibreCuda/src/main.cpp:31: LIBRECUDA_ERROR_UNKNOWN

Perhaps there are some hardware/driver requirements which I don't posses that are required for this to work? Here is my GPU information:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:07:00.0 Off |                  N/A |
| 35%   30C    P8             20W /  260W |      16MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1555      G   /usr/lib/xorg/Xorg                              9MiB |
|    0   N/A  N/A      1672      G   /usr/bin/gnome-shell                            4MiB |
+-----------------------------------------------------------------------------------------+
@mikex86
Copy link
Owner

mikex86 commented Aug 13, 2024

This unfortunately isn't a whole lot of information, I've added some better debugging information for failed RM_ALLOCs and RM_CTRLs.
If you find the time, feel free to pull latest changes and rerun and kindly provide the output again.

@bayedieng
Copy link
Author

Thanks, Sure.

Device count: 1
[LibreCuda Debug]: /home/bdieng/src/LibreCuda/src/librecuda.cpp:284: RM_ALLOC failed with status rm_alloc(fd_ctl, ctx->device->compute_class, root, gpfifo, 0, nullptr, 0, nullptr)
[CUDA ERROR] at file /home/bdieng/src/LibreCuda/src/main.cpp:31: LIBRECUDA_ERROR_UNKNOWN

@mikex86
Copy link
Owner

mikex86 commented Aug 13, 2024

Ah, well, it's because only Ampere and Ada GPUs are currently supported.
I unfortunately don't have access to a Turing GPU, but I have attempted a naive port and you can try to run the demo to see if it crashes or not.
I have created a branch called "turing". Feel free to pull latest changes from said branch and see if it works.

@bayedieng bayedieng changed the title RM Alloc Failure Turing Architecture Unsupported Aug 13, 2024
@bayedieng bayedieng changed the title Turing Architecture Unsupported Support Turing Architecture Aug 13, 2024
@bayedieng
Copy link
Author

That seems to fix it however, program now hangs as libreCuStreamAwait is called.

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000

@mikex86
Copy link
Owner

mikex86 commented Aug 13, 2024

Hmm, then it doesn't work yet. I expected something along these lines... I'll see what I can do

@mikex86
Copy link
Owner

mikex86 commented Aug 13, 2024

Can you go into cmdqueue.cpp line 149: and comment out as follows:

    LIBRECUDA_ERR_PROPAGATE(startExecution(COMPUTE));
    LIBRECUDA_ERR_PROPAGATE(awaitExecution());

    // setup copy queue
    /* BEGIN COMMENT HERE
    {
        LIBRECUDA_ERR_PROPAGATE(
                enqueue(makeNvMethod(4, NVC6C0_SET_OBJECT, 1), {get_dma_copy_type(ctx->device->compute_class)})
        );
        timelineCtr++;
    }

    LIBRECUDA_ERR_PROPAGATE(startExecution(DMA));
    LIBRECUDA_ERR_PROPAGATE(awaitExecution());
    END COMMENT HERE */

   // allocate kernargs page
    {
        LIBRECUDA_ERR_PROPAGATE(
   ...

What happens with these changes will help me debug what is going on.
With these changes, the code has a slight chance of working correctly...
If it does, I will know more.

@bayedieng
Copy link
Author

bayedieng commented Aug 13, 2024

still hangs when commented out:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000

For what it's worth the output does successfully exit when libreCuStreamAwait isn't called:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000
Dst value (post exec): 0.00000

@mikex86
Copy link
Owner

mikex86 commented Aug 13, 2024

What happens WITH libreCuStreamAwait but no libreCuLaunchKernel?

@bayedieng
Copy link
Author

It manages to execute without issue:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000
Dst value (post exec): 0.00000

@mikex86
Copy link
Owner

mikex86 commented Aug 14, 2024

Right, that was expected. If you find the time, can you try to run with libreCuLaunchKernel again + replace libreCuStreamAwait with std::this_thread::sleep_for(std::chrono::milliseconds(100));?

you will need

#include <chrono>
#include <thread>

With this I'm trying to figure out if the issue is exclusively semaphore related, or if the FIFO is also screwed...

@bayedieng
Copy link
Author

Builds Fine:

Device count: 1
Num functions: 3
  function "write_float_ptr"
  function "write_float_value"
  function "write_float_sum"
A value: 314
B value: 0.31415
Dst value (pre exec): 0.00000
Dst value (post exec): 0.00000

@mikex86
Copy link
Owner

mikex86 commented Aug 14, 2024

yikes, this is sort of the worst case scenario I expected. I don't think I can debug this without access to a Turing GPU... I will have to put this on ice for now...

@bayedieng
Copy link
Author

I was interested in interfacing with Cuda without NVIDIA's proprietary runtime as well at some point. Might submit a PR when I have time to delve in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants