Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MIG support #102

Open
armandpicard opened this issue Jun 22, 2021 · 15 comments
Open

Add MIG support #102

armandpicard opened this issue Jun 22, 2021 · 15 comments

Comments

@armandpicard
Copy link

armandpicard commented Jun 22, 2021

When using gpustat with MIG(Multi-Instance GPU) in Kubernetes we are not able to get metrics.

When running gpustat we get the main gpu name but not metrics about RAM.
20210622_17h55m12s_grim

This is due to the lack of permission on the root GPU.
We could get information about MIGs when listing a MIG enabled GPU. This could give information more information like ram but not compute utilization since it's not yet implemented in NVML.
This lead to issues in ray when getting metrics on GPU.

A PR will follow

@wookayin
Copy link
Owner

wookayin commented Jul 2, 2021

Thanks for reporting this. Would you able to get any relevant information from raw nvidia-smi command? If you have any good reference/documentation about it, that'd be also helpful.

@wookayin
Copy link
Owner

wookayin commented Jul 29, 2021

The issue is that the pynvml library we are relying on is not aware of MIG. One dirty but quick workaround would be parsing nvidia-smi output, but this doesn't seem right.

So I'll have to re-implement the low-level library on our own, apart from pynvml, using nvidia's official NVML C API, or simply fork pynvml and add MIG support. Another pynvml binding also lacks MIG support yet. This will be a non-trivial work but I'll be happy to work on it.

@wookayin
Copy link
Owner

wookayin commented Jul 29, 2021

I realized that there is now an NVIDIA official python nvml binding pynvml: https://pypi.org/project/nvidia-ml-py/ (being actively maintained) which has MIG support. So we'll have to switch to this pynvml library, then adding MIG support won't be difficult to implement (although I don't have any A100/MIG GPU available). Please stay tuned!

@armandpicard
Copy link
Author

I've try to use this library here https://github.com/instadeepai/gpustat/tree/fix-add-MIG-support but some functions do not support MIG for the moment like nvmlDeviceGetUtilizationRates( https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g540824faa6cef45500e0d1dc2f50b321)

@wookayin
Copy link
Owner

wookayin commented Jul 29, 2021

The API documentation and user guide say

On MIG-enabled GPUs, querying device utilization rates is not currently supported.

It looks like DCGM is the only way to go. Can you try some command line tools (nvidia-smi or dcgmi) to see if it's possible to get some numbers like GPU utilization as you want?

Reference:

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Aug 4, 2021

The official implementation of NVML Python Bindings (nvidia-ml-py) added MIG support since nvidia-ml-py>=11.450.129. But this will cause NVMLError_FunctionNotFound at _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2") for old NVIDIA drivers:

# Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v2(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);  # no v1 version now!!! incompatible with old drivers

And gpustat will not be able to gather process infos correctly on old machines if we arbitrarily update nvidia-ml-py's version in gpustat's requirements.

If we want to keep the simple install instruction pip3 install gpustat, it could be hard to determine which version of nvidia-ml-py should be installed as a dependency before gpustat is installed.

@wookayin
Copy link
Owner

wookayin commented Aug 4, 2021

@XuehaiPan Thanks; we will be using the official python bindings, which I already implemented but will be pushed quite soon. I wasn't aware that there is such a backward incompatibility around nvmlDeviceGetComputeRunningProcesses_v2. (See #105 as well)

So we must check this carefully with "old" GPU cards or "old" NVIDIA driver --- I wonder what is the exact setup to break. Also we might need to workaround this. One possible way is to monkey patch _nvmlGetFunctionPointer_cache; I will try a bit more and keep you posted.

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Aug 4, 2021

So we must check this carefully with "old" GPU cards or "old" NVIDIA driver --- I wonder what is the exact setup to break. Also we might need to workaround this.

On Ubuntu 16.04 LTS, the highest supported version of NVIDIA driver is nvidia-430:

$ cat /etc/*-release  
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-430/now 430.64-0ubuntu0~gpu16.04.2 amd64 [installed,upgradable to: 430.64-0ubuntu0~gpu16.04.6]
nvidia-cuda-doc/xenial,xenial,now 7.5.18-0ubuntu1 all [installed]
nvidia-cuda-gdb/xenial,now 7.5.18-0ubuntu1 amd64 [installed]
nvidia-opencl-dev/xenial,now 7.5.18-0ubuntu1 amd64 [installed]
nvidia-opencl-icd-430/now 430.64-0ubuntu0~gpu16.04.2 amd64 [installed,upgradable to: 430.64-0ubuntu0~gpu16.04.6]
nvidia-prime/xenial,now 0.8.2 amd64 [installed,automatic]
nvidia-settings/xenial,now 361.42-0ubuntu1 amd64 [installed,upgradable to: 418.56-0ubuntu0~gpu16.04.1]

Although Ubuntu 16.04 LTS has reached the end of its five-year LTS window on April 30th 2021. It is still widely used in industry and research laboratories due to poor IT services :(.

nvidia-ml-py==11.450.51 works fine on Ubuntu 16.04, but it does not have binding functions for MIG support.

$ pip3 install ipython nvidia-ml-py==11.450.51

$ ipython3                             
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pynvml import *

In [2]: nvmlInit()

In [3]: nvmlSystemGetDriverVersion()
Out[3]: b'430.64'

In [4]: handle = nvmlDeviceGetHandleByIndex(0)

In [5]: nvmlDeviceGetComputeRunningProcesses(handle)
Out[5]: []

In [6]: nvmlDeviceGetGraphicsRunningProcesses(handle)
Out[6]: [<pynvml.nvmlFriendlyObject at 0x7fb2a4d1c400>]

In [7]: list(map(str, nvmlDeviceGetGraphicsRunningProcesses(handle)))
Out[7]: ["{'pid': 1876, 'usedGpuMemory': 17580032}"]

nvidia-ml-py>=11.450.129 has added binding functions for MIG support, but it will raise errors when querying the PIDs with old drivers. Users should downgrade nvidia-ml-py manually to handle this or upgrade the NVIDIA driver (** privileges required **).

$ pip3 install ipython nvidia-ml-py==11.450.129

$ ipython3
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pynvml import *

In [2]: nvmlInit()

In [3]: nvmlSystemGetDriverVersion()
Out[3]: b'430.64'

In [4]: handle = nvmlDeviceGetHandleByIndex(0)

In [5]: nvmlDeviceGetComputeRunningProcesses(handle)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in _nvmlGetFunctionPointer(name)
    719         try:
--> 720             _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
    721             return _nvmlGetFunctionPointer_cache[name]

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/ctypes/__init__.py in __getattr__(self, name)
    386             raise AttributeError(name)
--> 387         func = self.__getitem__(name)
    388         setattr(self, name, func)

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/ctypes/__init__.py in __getitem__(self, name_or_ordinal)
    391     def __getitem__(self, name_or_ordinal):
--> 392         func = self._FuncPtr((name_or_ordinal, self))
    393         if not isinstance(name_or_ordinal, int):

AttributeError: /usr/lib/nvidia-430/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

NVMLError_FunctionNotFound                Traceback (most recent call last)
<ipython-input-4-ef8a5a47bcb8> in <module>
----> 1 nvmlDeviceGetComputeRunningProcesses(handle)

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in nvmlDeviceGetComputeRunningProcesses(handle)
   2093 
   2094 def nvmlDeviceGetComputeRunningProcesses(handle):
-> 2095     return nvmlDeviceGetComputeRunningProcesses_v2(handle);
   2096 
   2097 def nvmlDeviceGetGraphicsRunningProcesses_v2(handle):

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in nvmlDeviceGetComputeRunningProcesses_v2(handle)
   2061     # first call to get the size
   2062     c_count = c_uint(0)
-> 2063     fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
   2064     ret = fn(handle, byref(c_count), None)
   2065 

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in _nvmlGetFunctionPointer(name)
    721             return _nvmlGetFunctionPointer_cache[name]
    722         except AttributeError:
--> 723             raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
    724     finally:
    725         # lock is always freed

NVMLError_FunctionNotFound: Function Not Found

@wookayin
Copy link
Owner

wookayin commented Aug 4, 2021

So I think the driver version is old, not the graphic card. BTW it is recommended to install nvidia drivers from official binary. (Although gpustat can still support such legacy drivers)

With the old nvidia drivers, however, can you try the following?

v1 = pynvml._nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses")
v2 = pynvml._nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")

I guess v1 will still work but v2 will raise as you already showed in the stacktrace. In my environments with a recent version of nvidia driver, both work.

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Aug 4, 2021

I guess v1 will still work but v2 will raise as you already showed in the stacktrace. In my environments with a recent version of nvidia driver, both work.

v1 NVIDIA 430.64 NVIDIA 470.57.02
nvidia-ml-py==11.450.51 works but without CI ID / GI ID works but without CI ID / GI ID
nvidia-ml-py>=11.450.129 no exceptions in Python
but gets wrong results
(subscript out of range in C library)
no exceptions in Python
but gets wrong results
(subscript out of range in C library)
v2 NVIDIA 430.64 NVIDIA 470.57.02
nvidia-ml-py==11.450.51 function not found no exceptions in Python
but gets wrong results
(subscript out of range in C library)
nvidia-ml-py>=11.450.129 function not found works with correct CI ID / GI ID

@wookayin
Copy link
Owner

wookayin commented Aug 5, 2021

@XuehaiPan Can you elaborate on the meaning of CI ID / GI ID?
Got it. Compute Instance (CI) ID and GPU Instance (GI) ID.

So I think falling back to v1 function for old drivers will be the best option to make obtaining process information work in either case.

@tengzi-will

This comment was marked as off-topic.

@wookayin

This comment was marked as off-topic.

@wookayin wookayin modified the milestones: 1.0, 1.1 Sep 4, 2022
@wookayin wookayin modified the milestones: 1.1, 1.2 Mar 2, 2023
@starry91
Copy link

starry91 commented Mar 9, 2023

The official implementation of NVML Python Bindings (nvidia-ml-py) added MIG support since nvidia-ml-py>=11.450.129

@XuehaiPan @wookayin Correct me if I am wrong, but as per my understanding the only support here is w.r.t to memory usage and MIG profile related info and not the utilization stats. Last I checked, DCGM was the only way to get the utilization stats for MIG enabled devices.

@diricxbart
Copy link

Hello @wookayin,
would you have an update on this topic?
I would need MIG support for this in the coming weeks and would like to know if we should start looking into other solutions...
I could assist in validating if needed based on an NVidia A100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants