-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MIG support #102
Comments
Thanks for reporting this. Would you able to get any relevant information from raw |
The issue is that the So I'll have to re-implement the low-level library on our own, apart from |
I realized that there is now an NVIDIA official python nvml binding pynvml: https://pypi.org/project/nvidia-ml-py/ (being actively maintained) which has MIG support. So we'll have to switch to this |
I've try to use this library here https://github.com/instadeepai/gpustat/tree/fix-add-MIG-support but some functions do not support MIG for the moment like |
The API documentation and user guide say
It looks like DCGM is the only way to go. Can you try some command line tools (nvidia-smi or dcgmi) to see if it's possible to get some numbers like GPU utilization as you want? Reference: |
The official implementation of NVML Python Bindings (nvidia-ml-py) added MIG support since # Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v2(handle):
# first call to get the size
c_count = c_uint(0)
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
ret = fn(handle, byref(c_count), None)
...
def nvmlDeviceGetComputeRunningProcesses(handle):
return nvmlDeviceGetComputeRunningProcesses_v2(handle); # no v1 version now!!! incompatible with old drivers And If we want to keep the simple install instruction |
@XuehaiPan Thanks; we will be using the official python bindings, which I already implemented but will be pushed quite soon. I wasn't aware that there is such a backward incompatibility around So we must check this carefully with "old" GPU cards or "old" NVIDIA driver --- I wonder what is the exact setup to break. Also we might need to workaround this. One possible way is to monkey patch |
On Ubuntu 16.04 LTS, the highest supported version of NVIDIA driver is $ cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
$ apt list --installed | grep nvidia
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
nvidia-430/now 430.64-0ubuntu0~gpu16.04.2 amd64 [installed,upgradable to: 430.64-0ubuntu0~gpu16.04.6]
nvidia-cuda-doc/xenial,xenial,now 7.5.18-0ubuntu1 all [installed]
nvidia-cuda-gdb/xenial,now 7.5.18-0ubuntu1 amd64 [installed]
nvidia-opencl-dev/xenial,now 7.5.18-0ubuntu1 amd64 [installed]
nvidia-opencl-icd-430/now 430.64-0ubuntu0~gpu16.04.2 amd64 [installed,upgradable to: 430.64-0ubuntu0~gpu16.04.6]
nvidia-prime/xenial,now 0.8.2 amd64 [installed,automatic]
nvidia-settings/xenial,now 361.42-0ubuntu1 amd64 [installed,upgradable to: 418.56-0ubuntu0~gpu16.04.1] Although Ubuntu 16.04 LTS has reached the end of its five-year LTS window on April 30th 2021. It is still widely used in industry and research laboratories due to poor IT services :(.
$ pip3 install ipython nvidia-ml-py==11.450.51
$ ipython3
Python 3.9.6 (default, Jun 28 2021, 08:57:49)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from pynvml import *
In [2]: nvmlInit()
In [3]: nvmlSystemGetDriverVersion()
Out[3]: b'430.64'
In [4]: handle = nvmlDeviceGetHandleByIndex(0)
In [5]: nvmlDeviceGetComputeRunningProcesses(handle)
Out[5]: []
In [6]: nvmlDeviceGetGraphicsRunningProcesses(handle)
Out[6]: [<pynvml.nvmlFriendlyObject at 0x7fb2a4d1c400>]
In [7]: list(map(str, nvmlDeviceGetGraphicsRunningProcesses(handle)))
Out[7]: ["{'pid': 1876, 'usedGpuMemory': 17580032}"]
$ pip3 install ipython nvidia-ml-py==11.450.129
$ ipython3
Python 3.9.6 (default, Jun 28 2021, 08:57:49)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from pynvml import *
In [2]: nvmlInit()
In [3]: nvmlSystemGetDriverVersion()
Out[3]: b'430.64'
In [4]: handle = nvmlDeviceGetHandleByIndex(0)
In [5]: nvmlDeviceGetComputeRunningProcesses(handle)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in _nvmlGetFunctionPointer(name)
719 try:
--> 720 _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
721 return _nvmlGetFunctionPointer_cache[name]
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/ctypes/__init__.py in __getattr__(self, name)
386 raise AttributeError(name)
--> 387 func = self.__getitem__(name)
388 setattr(self, name, func)
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/ctypes/__init__.py in __getitem__(self, name_or_ordinal)
391 def __getitem__(self, name_or_ordinal):
--> 392 func = self._FuncPtr((name_or_ordinal, self))
393 if not isinstance(name_or_ordinal, int):
AttributeError: /usr/lib/nvidia-430/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2
During handling of the above exception, another exception occurred:
NVMLError_FunctionNotFound Traceback (most recent call last)
<ipython-input-4-ef8a5a47bcb8> in <module>
----> 1 nvmlDeviceGetComputeRunningProcesses(handle)
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in nvmlDeviceGetComputeRunningProcesses(handle)
2093
2094 def nvmlDeviceGetComputeRunningProcesses(handle):
-> 2095 return nvmlDeviceGetComputeRunningProcesses_v2(handle);
2096
2097 def nvmlDeviceGetGraphicsRunningProcesses_v2(handle):
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in nvmlDeviceGetComputeRunningProcesses_v2(handle)
2061 # first call to get the size
2062 c_count = c_uint(0)
-> 2063 fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
2064 ret = fn(handle, byref(c_count), None)
2065
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in _nvmlGetFunctionPointer(name)
721 return _nvmlGetFunctionPointer_cache[name]
722 except AttributeError:
--> 723 raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
724 finally:
725 # lock is always freed
NVMLError_FunctionNotFound: Function Not Found |
So I think the driver version is old, not the graphic card. BTW it is recommended to install nvidia drivers from official binary. (Although gpustat can still support such legacy drivers) With the old nvidia drivers, however, can you try the following?
I guess v1 will still work but v2 will raise as you already showed in the stacktrace. In my environments with a recent version of nvidia driver, both work. |
|
@XuehaiPan So I think falling back to v1 function for old drivers will be the best option to make obtaining process information work in either case. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@XuehaiPan @wookayin Correct me if I am wrong, but as per my understanding the only support here is w.r.t to memory usage and MIG profile related info and not the utilization stats. Last I checked, DCGM was the only way to get the utilization stats for MIG enabled devices. |
Hello @wookayin, |
When using gpustat with MIG(Multi-Instance GPU) in Kubernetes we are not able to get metrics.
When running gpustat we get the main gpu name but not metrics about RAM.
This is due to the lack of permission on the root GPU.
We could get information about MIGs when listing a MIG enabled GPU. This could give information more information like ram but not compute utilization since it's not yet implemented in NVML.
This lead to issues in ray when getting metrics on GPU.
A PR will follow
The text was updated successfully, but these errors were encountered: