Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcuda.so works once but then disappears from WSL2 after restart with an external GPU #5604

Closed
eteq opened this issue Jul 17, 2020 · 7 comments
Labels

Comments

@eteq
Copy link

eteq commented Jul 17, 2020

Environment

Windows build number: 10.0.20170.0
Your Distribution version: Ubuntu 18.04
Whether the issue is on WSL 2 and/or WSL 2: WSL 2

Also relevant here is that the computer I'm testing on is a laptop with only integrated graphics built in, but connected via Thunderbolt 3 to an external GPU enclosure with a Nvidia graphics card.

Steps to reproduce

  1. Connect the external GPU enclosure to the laptop (this all runs flawlessly on the Windows side, with drivers correctly recognizing the external GPU and using it when requested).
  2. Open up WSL and do ls /usr/lib/wsl/lib

Expected behavior

I should see libcuda.so in /usr/lib/wsl/lib, and cuda should work as per the new WSL 2 integration.

Actual behavior

Instead I see:
libd3d12.so libd3d12core.so libdirectml.so libdxcore.so

Note - I do see the correct behavior if I re-install the Nvidia drivers linked from the CUDA pages (451.67), and CUDA applications run happily in WSL if I do this. This makes me confident it's not a driver or hardware problem. But after I restart (and maybe even just disconnect the external GPU?) they are gone and I can't find a way to get them back except by re-installing the drivers.

@therealkenc
Copy link
Collaborator

therealkenc commented Jul 17, 2020

What you describe can be #5506 if you have multiple distros and aren't watching which one is fired up first.

Failing that, have a look at the dmesg output on the failing run, and please post the error messages that look mount/filesystem related.

[ed,rethink] You might try a wsl --shutdown or two. That has cured /usr/lib/wsl/lib mount problems for some; it might or might not help with your Thunderbolt scenario. Worth a try anyway. Having the GPU plugged in at Windows boot probably wouldn't hurt at least as an experiment too. There is no udev like mechanism on the WSL side (that I know of) to manage the necessity of mounting that lib directory. The working premise here is your drivers are probably fine (why wouldn't they be) and the problem is related to WSL startup.

@eteq
Copy link
Author

eteq commented Jul 22, 2020

Thanks for the suggestions @therealkenc . I don't think this is a dupe of #5506, because I tried wsl --shutdown and then tried starting up a different distro first. The result was libd3d12.so libd3d12core.so libdirectml.so libdxcore.so in the secondary distro and nothing in /usr/lib/wsl/lib in my default (Ubuntu) distro. Then if I did another wsl --shutdown and started Ubuntu first, I got the dsmr so's but back in Ubuntu, and not in the secondary distro (which looks to me the #5506 behavior as stated). But regardless of all of that, I never saw libcuda.so in any of the distros.

I also tried doing a wsl --shutdown after plugging in the external drive, before, etc (I think in all the permutations) as well as having the GPU plugged in at boot. None of those brought back libcuda - the only thing that has worked so far is re-installing the Nvidia drivers (and then only until next boot).

Is there any way to manually trigger the /usr/lib/wsl/lib mounting process? Even just a way to manually set up some symlinks might work as a temporary measure if I know where to point them on the windows-side...

@therealkenc
Copy link
Collaborator

Okay thanks for the additional info, appreciated. Yeah the relationship to #5506 was a lark, which barely survived my [ed,rethink]. Still, it is "weird" that the drivers don't mount after your enclosure is unplugged (presumably unplugged even once). One good place to ask is over in the NVIDIA CUDA on WSL forum. I don't personally have an external GPU to give you a confirm, but maybe some others using Thunderbolt will chime with a me2.

Is there any way to manually trigger the /usr/lib/wsl/lib mounting process?

That's not a bad question. That directory is C:\Windows\System32\lxss\lib which you could in principle bind mount or even outright copy over. But there is also /usr/lib/wsl/drivers and I haven't looked under the hood enough to know where those come from. The files themselves might or might not be sufficient. They might, but only if you also see a live /dev/dxg as a minimum barrier to entry. The NVIDIA guys might have a temporary work-around.

@eteq
Copy link
Author

eteq commented Jul 24, 2020

Update: after updating to the latest insider dev build (20175) and re-installing the cuda drivers, the cuda libs now seem to survive disconnection of the egpu. Which is making me realize it's possible that the original problem was that the libs disappear when a new update is installed. I think the first time I encountered this it was at the same time as a new update got installed.

So I cannot reproduce this now, but I suspect it should be left open at least until the next update gets installed since I'm still not sure what the underlying cause was?

@therealkenc
Copy link
Collaborator

Yeah we let 'em float open for an unspecified period as a place for me2s to land if it seems plausible they might. Yours satisfies that plausibility.

@eteq
Copy link
Author

eteq commented Aug 5, 2020

The problem has resurfaced as stated, although with (possibly?) a subtly different cause: while /usr/lib/wsl/lib/libcuda.* are still there, there's now a more complicated issue: they seem to be pointing to (absent) alternative drivers. To be more concrete:

  1. I run the deviceQuery sample that's included in the CUDA toolkit (anything that's linked to libcuda will do, but that's just the easiest example). It yields cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version (which is what led me down the path to submitting this issue in the first place).
  2. I try again using strace ./deviceQuery
  3. I see the following: openat(AT_FDCWD, "/usr/lib/wsl/drivers/nv_dispi.inf_amd64_edab19158bdd0d0a/libcuda.so.1.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

Previously (shortly after installing the most recent time when ./deviceQuery worked), strace successfully found a very similar file (/usr/lib/wsl/drivers/nv_dispi.inf_amd64_9e1e1f7307267df4/libcuda.so.1.1). I don't really understand anything about how /usr/lib/wsl gets populated, but I speculate that for some reason a new driver installation seems to have happened which is missing the relevant so's?

Copy link
Contributor

This issue has been automatically closed since it has not had any activity for the past year. If you're still experiencing this issue please re-file this as a new issue or feature request.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants