Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Unable to use GPU features because my driver version can't be parsed from nvidia-smi #5056

Closed
Burke9077 opened this issue Jan 20, 2020 · 4 comments
Labels
0.22.1 release bug P3 Severity-Low/Effort-hard

Comments

@Burke9077
Copy link

Description

I've been unable to use GPU features. The program disables the feature after being unable to parse the version from nvidia-smi because my patch version "440.33.01" has a leading zero.

Logs below.

Golem Version:

GOLEM Version: 0.22.0
Protocol Version: 32

Golem-Messages version (leave empty if unsure):

golem_messages Version: 3.14.1

Electron version (if used):

Not used, error in initial startup

OS [e.g. Windows 10 Pro]:

Ubuntu 18.04
system: Linux, release: 5.3.0-26-generic, version: #28~18.04.1-Ubuntu SMP Wed Dec 18 16:40:14 UTC 2019, machine: x86_64

Branch (if launched from source):

Not from source

Mainnet/Testnet:

Mainnet

Priority label is set to the lowest by default. To setup higher priority please change the label
P0 label is set for Severity-Critical/Effort-easy
P1 label is set for Severity-Critical/Effort-hard
P2 label is set for Severity-Low/ Effort-easy
P3 label is set for Severity-Low/Effort-hard

P2

I would call this a low sev/easy effort. It does block me from continuing but it doesn't seem to be something a lot of people are experiencing and I can probably work around it by installing a different version.

Description of the issue:

A clear and concise description of what went wrong, in which component, when and where.

After installing I got an error message in the logs showing that my GPU is disabled because my patch version from nvidia-smi contains a leading zero which the program is not expecting.

Output in the logs:

2020-01-20 02:32:18 INFO     golem.environments.environmentsmanager Adding environment BLENDER supported=<SupportStatus ok ({})> 
2020-01-20 02:32:18 WARNING  apps.core.nvgpu                     NVGPU Docker environment is not supported: RuntimeError("Unable to parse nvidia-smi output: Invalid leading zero in patch: '440.33.01'",) 
2020-01-20 02:32:18 INFO     golem.environments.environmentsmanager Adding environment BLENDER_NVGPU supported=<SupportStatus err ({<UnsupportReason.ENVIRONMENT_UNSUPPORTED: 'environment_unsupported'>: 'BLENDER_NVGPU'})> 
2020-01-20 02:32:18 INFO     golem.environments.environmentsmanager Adding environment WASM supported=<SupportStatus ok ({})> 

Nvidia-smi output:

Mon Jan 20 02:41:08 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:2D:00.0  On |                  N/A |
|  0%   35C    P8    22W / 250W |    610MiB / 11016MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1443      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      1489      G   /usr/bin/gnome-shell                          58MiB |
|    0      3690      G   /usr/lib/xorg/Xorg                           231MiB |
|    0      4107      G   /usr/bin/gnome-shell                         145MiB |
|    0      6556      G   ...uest-channel-token=15026858408320960458   144MiB |
+-----------------------------------------------------------------------------+

Actual result:

What is the observed behavior and/or result in this issue

I'm unable to use my GPU for the network.

Screenshots:

If applicable, add screenshots to help explain your problem.

Text output above.

Steps To Reproduce

Short description of steps to reproduce the behavior:
e.g.

  1. Launch the app with '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Use a version with a patch that begins with a zero.

Proposed Solution?

(Optional: What could be a solution for that issue)

Evaluate the patch version differently.

@Burke9077 Burke9077 added P3 Severity-Low/Effort-hard brass bug labels Jan 20, 2020
@djohnlee
Copy link

I had the same issue. Downgraded the nVidia driver to 430 (sudo apt install nvidia-driver-430), rebooted, and it was resolved. "golemcli envs show" shows the GPU environments enabled.

Linux Mint 19.3 (ubuntu Bionic-based)
5.0.0-32-generic #34~18.04.2-Ubuntu

@MathiasBras
Copy link

I just tested this with the new 0.22.1 release and it is still an issue
2020-01-30 17:52:56 WARNING apps.core.nvgpu NVGPU Docker environment is not supported: RuntimeError("Unable to parse nvidia-smi output: Invalid leading zero in patch: '440.48.02'",)

@MathiasBras
Copy link

MathiasBras commented Feb 1, 2020

I had the same issue. Downgraded the nVidia driver to 430 (sudo apt install nvidia-driver-430), rebooted, and it was resolved. "golemcli envs show" shows the GPU environments enabled.

Linux Mint 19.3 (ubuntu Bionic-based)
5.0.0-32-generic #34~18.04.2-Ubuntu

Looks like your fix "worked" for me too. Although it seems I have an other problem which maybe is related to something specific to my configuration
2020-01-31 17:44:19 WARNING golem.docker.manager Docker: pulling image 'golemfactory/nvgpu:1.7' 2020-01-31 17:45:29 WARNING golem.docker.manager Docker: pulling image 'golemfactory/blender_nvgpu:1.7' 2020-01-31 17:45:59 WARNING golem.task.taskthread Task computing error 400 Client Error: Bad Request ("Unknown runtime specified nvidia") 2020-01-31 17:45:59 WARNING golem.task Failed to compute benchmark 400 Client Error: Bad Request ("Unknown runtime specified nvidia") 2020-01-31 17:45:59 ERROR golem.task.benchmarkmanager Unable to run BLENDER_NVGPU benchmark: 400 Client Error: Bad Request ("Unknown runtime specified nvidia") 2020-01-31 17:45:59 CRITICAL golem.client Can't start network. Giving up.

I fixed the above with the script here https://github.com/golemfactory/golem/pull/4608

@letol
Copy link

letol commented Feb 19, 2020

Same problem here with the 0.22.1 release. I found a possible definitive solution here that don't requires nvidia-driver downgrade, but I don't have enough time to experiment on source code. I leave it here as a possible reference.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
0.22.1 release bug P3 Severity-Low/Effort-hard
Projects
None yet
Development

No branches or pull requests

7 participants