Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuCtxCreate triggers NV_ERR_INVALID_ADDRESS with H100 CC-mode on #9921

Closed
derpsteb opened this issue Jan 25, 2024 · 11 comments
Closed

cuCtxCreate triggers NV_ERR_INVALID_ADDRESS with H100 CC-mode on #9921

derpsteb opened this issue Jan 25, 2024 · 11 comments
Labels
area: gpu Issue related to sandboxed GPU access type: bug Something isn't working

Comments

@derpsteb
Copy link
Contributor

derpsteb commented Jan 25, 2024

Description

Hey everyone,
we are currently trying to utilize nvproxy with an H100 GPU that has it's confidential computing mode enabled. However, when trying to create a context on the GPU libcuda ends up in an endless loop. You can find the two syscalls that loop highlighted here. When running inside gvisor uvm_validate_va_range returns NV_ERR_INVALID_ADDRESS. The other logfiles in the libcuda-debug repo are stacktraces and process memory mappings while executing natively. The backtrace at the mmap actually is the same inside gvisor and natively.

To get to this point we had to apply a few patches to gVisor. You can find them here. I am not very familiar with gVisor so those patches may already be faulty. Please review them before you dive deep into any debugging on your side.

Do you have any idea what could be the problem here? I would also appreciate any hints describing your dev setup while developing nvproxy. Since that could help our efforts right now. We already built a custom strace that decodes the ioctl cmds.

Steps to reproduce

  • setup host with cuda toolkit 12.2.2 (comes with cuda driver 535.104.05) and nvidia container toolkit. set h100 to devtools cc mode.
  • git clone [email protected]:derpsteb/gvisor.git
  • git checkout h100-cc-mode
  • make copy TARGETS=runsc DESTINATION=bin/ && sudo cp ./bin/runsc /usr/local/bin
  • git clone [email protected]:derpsteb/libcuda-debug.git && cd libcuda-debug
  • git checkout gvisor-bugreport
  • cd cuMemory && make
  • verify that the binary works natively: ./cuMemory. It is expected that it cuInit takes a while.
  • verify that the binary works in a normal docker container: docker run -ti --gpus=all -v $(realpath ./cuMemory):/cuMemory nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04 /cuMemory
  • this will loop, so be prepared to smash ctrl+c or run pkill: docker run -ti --runtime=runsc --gpus=all -v $(realpath ./cuMemory):/cuMemory nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04 /cuMemory

runsc version

selfbuilt from: https://github.com/derpsteb/gvisor/commit/1bc75a2811864d41b24396079a4995646d172015

docker version (if using docker)

Client: Docker Engine - Community
 Version:           25.0.0
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        e758fe5
 Built:             Thu Jan 18 17:09:49 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.0
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       615dfdf
  Built:            Thu Jan 18 17:09:49 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.27
  GitCommit:        a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc:
  Version:          1.1.11
  GitCommit:        v1.1.11-0-g4bccb38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

Linux guest 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

release-20231218.0-13-g1bc75a281

runsc debug logs (if available)

https://gist.github.com/derpsteb/0533a5b9acd1bf21938cf0245dbbd0cb

Because it's so long. I also shortened the looping section at the end - you can spot it by searching for the mmap with length `0x3ab000`.
@derpsteb derpsteb added the type: bug Something isn't working label Jan 25, 2024
@ayushr2
Copy link
Collaborator

ayushr2 commented Jan 25, 2024

Thanks for the detailed report!

I don't have access to a H100 right now unfortunately. I tried reproducing this on a T4 (needed to substitute compute_90 -> compute_86 in https://github.com/derpsteb/libcuda-debug/blob/main/cuMemory/Makefile), but it works fine there.

I believe the nvtrace at https://github.com/derpsteb/libcuda-debug/blob/gvisor-bugreport/cuMemory/cuCtxCreate_nvtrace.log is from inside the container? The address range base=0x7fb87e200000, length=0x3ab000 does seem to be mapped inside the container (because the application make the preceding mmap(2) call with the same arguments). But the application address space is different from the sentry address space. Since the sentry is making the ioctl(2) call to the host, the sentry address space is evaluated by the host driver.

nvproxy passes those arguments to the host without any translation (since UVM_VALIDATE_VA_RANGE is handled with uvmIoctlSimple[nvgpu.UVM_VALIDATE_VA_RANGE_PARAMS]). The host driver then checks for this VMA in the sentry's address space. So we need to see if this address is mapped on the sandbox process itself.

Could you try to repro this and apart from the nvtrace, could you also grab the /proc/<sandbox-pid>/maps output? You can find <sandbox-pid> by ps aux | grep runsc-sandbox usually.

@derpsteb
Copy link
Contributor Author

Awesome. Thanks for looking into this so quickly!

The log you referenced is from the native case, i.e. no gvisor, no sandbox. Sorry for not labeling this more clearly. We weren't able to run nvtrace inside gvisor yet, probably because it uses ptrace. But it seems like we fixed it by implementing pwrite. It worked now.

The logs you requested:

  • runsc memory map: This is created by running the payload, waiting until it loops, opening a second terminal and executing ps aux | grep runsc-sandbox and cat /proc/$PID/maps.
  • nvtrace inside gvisor: This was created by running docker run -ti --runtime=runsc --gpus=all -v /home/guest/libcuda-debug/:/libcuda-debug -v /home/guest/nvtrace:/nvtrace nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04-gdb /nvtrace/nvtrace /libcuda-debug/cuMemory/cuMemory | tee nvtrace.log. The memory maps included here are of the tracee.

We also added a log to the uvm_validate function in the kernel driver to print the arguments it sees. These are the addresses:

va_range->node.start 0x00000000ca1b630a
params 0x000000005509d7de
params->base 0x0000000017812cbe
params->length 0x00000000ca44aa17

@derpsteb
Copy link
Contributor Author

derpsteb commented Feb 2, 2024

Another thing that I didn't realize so far: the second mmap returns "success". In the native case there is a mapping afterwards. Inside gvisor, the mmap call is also successful, but the mapping is missing.

nvtrace doesn't print the mmap return value, but I checked by (a) looking at straces/gvisor's strace and by (b) adding prints to the kernel code. To make sure we are not hitting this case inside the driver. We don't.

We also wondered if uvm mmap translation might break permission expectations, since gvisor changes perms of the call during translation. But the previous mmap is mapped with the same permissions as in the native case, which is working. So it seems like gVisor is behaving correctly here.

@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 5, 2024

Thanks for the logs. The runsc-sandbox process mappings does not show a mapping for 0x7ec0a0000000, which explains the NV_ERR_INVALID_ADDRESS.

@nixprime Do you know why application mmap(2) of /dev/nvidia-uvm is not being reflected in sentry process mappings? AFAICT looking at https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/devices/nvproxy/uvm_mmap.go, it should be passed through?

Similarly, none of the application mmaps of /dev/nvidiactl are also not reflected in the sentry address space.

@nixprime
Copy link
Member

nixprime commented Feb 5, 2024

Application mmap()s of /dev/nvidiactl and /dev/nvidia-uvm aren't expected to create mappings in runsc's address space, only that of the application (on the systrap platform, this means one of runsc's subprocesses); in fact there should currently be no way to create a mapping in runsc's address space (see uvmFDMemmapFile.MapInternal()). uvm_api_validate_va_range() does not search VMAs; instead it searches a data structure internal to the uvm_va_space_t which is inserted-into by (among other things) mmap() of /dev/nvidia-uvm. When runsc causes host mmap() to be invoked by the application process, it causes the range passed to that call to mmap() to be registered in the uvm_va_space_t; when runsc issues UVM_VALIDATE_VA_RANGE, it should be able to observe the registered range despite being in a different process because we force UVM_INIT_FLAGS_MULTI_PROCESS_SHARING_MODE in nvproxy.uvmInitialize().

Can you also collect /proc/[pid]/maps from subprocesses of runsc after the call to mmap() that maps the range that is failing UVM_VALIDATE_VA_RANGE? I'm not sure if confidential compute affects any of this, or if relevant code might be missing from the open-source driver.

@derpsteb
Copy link
Contributor Author

derpsteb commented Feb 6, 2024

Thanks for getting back on this.
I think we have figured out our problem here. gvisor's MemoryManager internally merges mappings if they are next to each other. Because of the sentry-internal merging of mappings, the mmap calls from sentry to the host kernel have the wrong lengths. And the workload doesn't see the correct mappings. By stopping sentry from merging mappings for nvidia-uvm we were able to get our example to run. We will clean up our patch, do some more testing and eventually open a PR :)

@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 9, 2024

I would also appreciate any hints describing your dev setup while developing nvproxy. Since that could help our efforts right now. We already built a custom strace that decodes the ioctl cmds.

I had been using https://github.com/geohot/cuda_ioctl_sniffer to "sniff" ioctls made to the Nvidia devices. I would run this on the target GPU-binary directly on the host (without gVisor) and collect the output. I have also built a parser (written in golang) which is written against nvproxy package. It parses the output of the sniffer, figures out which commands/ioctls/classes are not implemented in nvproxy, and prints out the diff of what needs to be implemented.

However, it seems like https://github.com/geohot/cuda_ioctl_sniffer is not actively maintained and the output of the sniffer is a little garbled (I have had to patch it in various places to make it palatable for the parser). Also it seems like the sniffer segfaults with R550+ drivers (soon to release).

I think a more sustainable path forward would be to build this in-house and adding it to tools/gpu directory. If you look at nvproxy driver, it branches in 4 places (to make necessary translations as per Nvidia driver ABI):

  1. Frontend ioctl (ioctl(2) syscalls made to /dev/nvidia# and /dev/nvidiactl)
  2. Control command (IOC_NR(request)=NV_ESC_RM_CONTROL in frontend ioctl)
  3. Allocation class (IOC_NR(request)=NV_ESC_RM_ALLOC in frontend ioctl)
  4. UVM ioctl (ioctl(2) syscalls made to /dev/nvidia-uvm)

So we could have a LD-preloadable binary like nvtrace which intercepts ioctl syscalls and prints info about the above 4. I can then tweak my parser to consume this output instead and upstream it. So please consider upstreaming nvtrace (or whatever useful tooling you have)!

cc @luiscape

@thundergolfer
Copy link
Contributor

Thanks @ayushr2, we have our own fork of the sniffer with minor patches: https://github.com/modal-labs/cuda_ioctl_sniffer. Having this tool in tools/gpu would be great!

We haven't invested in internal tooling that much at all yet, but nvtrace sounds like something to build for sure.

For now I'll lean on our fork of the sniffer to sort out these H100 compatibility issues.

@derpsteb
Copy link
Contributor Author

derpsteb commented Feb 9, 2024

Thanks a lot for the writeup. Would love to contribute our tooling here. We are currently aligning internally on this.

Nvtrace uses ptrace to intercept syscalls. And then parses the args. It's a modified xfstrace. Worked very well for us.

@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 9, 2024

@thundergolfer I sent our sniffer patch to modal-labs/cuda_ioctl_sniffer#1.

@derpsteb Awesome! Looking forward to it! We could also just pull in the necessary components from a https://github.com/edgelesssys repository if it resides there. I will just write my parser against it. And the parser will be linked into nvproxy package to provide accurate info.

If nvtrace is written in Golang, one benefit of having it here would be that you can easily integrate it with packages like pkg/abi/nvgpu:nvgpu which has a lot of the Nvidia driver constants.

@ayushr2 ayushr2 added the area: gpu Issue related to sandboxed GPU access label Apr 26, 2024
@AC-Dap
Copy link
Contributor

AC-Dap commented Jun 18, 2024

Just a quick update on the tooling @ayushr2 described above; we now have a simple tool to intercept Nvidia ioctl calls in tools/ioctl_sniffer. Right now it simply runs a GPU binary unsandboxed and reports any ioctls/commands/classes that nvproxy doesn't support, but we plan to expand on its functionality in the next few months.

Hope this tool can be of use!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: gpu Issue related to sandboxed GPU access type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants