[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

rjzamora · 2023-12-18T23:42:26Z

As pointed out by @oliverholworthy in #274 (comment), cuda_isavailable() is used in merlin.core.compat to check for cuda support. Unfortunately, this is a known problem for dask-cuda.

This most likely means that Merlin/NVTabular has not worked properly with Dask-CUDA for more than six months now. For example, the following code will produce an OOM error for 32GB V100s:

import time
from merlin.core.utils import Distributed
if __name__ == "__main__":
    with Distributed(rmm_pool_size="24GiB"):
        time.sleep(30)

You will also see an error if you don't import any merlin/nvt code, but use the offending cuda.is_available() command:

import time
from numba import cuda # This is fine
cuda.is_available() # This is NOT
from dask_cuda import LocalCUDACluster
if __name__ == "__main__":
    with LocalCUDACluster(rmm_pool_size="24GiB") as cluster:
        time.sleep(30)

Meanwhile, the code works fine if you don't sue the offending command or import code that also imports merlin.core.compat:

import time
from dask_cuda import LocalCUDACluster
if __name__ == "__main__":
    with LocalCUDACluster(rmm_pool_size="24GiB") as cluster:
        time.sleep(30)

The text was updated successfully, but these errors were encountered:

rjzamora · 2023-12-18T23:44:15Z

cc @jperez999 @karlhigley

rjzamora · 2024-01-10T21:58:33Z

@jperez999 - Do you think this line is actually necessary? If we already have HAS_GPU from the line above, maybe we can just do:

if not HAS_GPU:
    cuda = None

pentschev · 2024-01-12T07:52:10Z

Besides the above I was looking at the code in more detail and I see the following block:

core/merlin/core/compat/__init__.py

Lines 102 to 105 in 6e52b48

    
           if kind == "free": 
        
               return int(cuda.current_context().get_memory_info()[0]) 
        
           else: 
        
               return int(cuda.current_context().get_memory_info()[1])

This is creating a new context on a GPU only to query memory size, and CUDA context should never be addressed before Dask initializes the cluster. Also note in the pynvml_mem_size there's an equivalent code block:

core/merlin/core/compat/__init__.py

Lines 57 to 60 in 6e52b48

    
           if kind == "free": 
        
               size = int(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(index)).free) 
        
           elif kind == "total": 
        
               size = int(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(index)).total)

The PyNVML code will NOT create CUDA context and is safe to run before Dask. Is there a reason why you're using the code block with Numba to query GPU memory instead of always using PyNVML for that?

pentschev mentioned this issue Jan 12, 2024

use HAS_GPU to determine of cuda is available #364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

rjzamora commented Dec 18, 2023

rjzamora commented Dec 18, 2023

rjzamora commented Jan 10, 2024 •

edited

Loading

pentschev commented Jan 12, 2024

[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

Comments

rjzamora commented Dec 18, 2023

rjzamora commented Dec 18, 2023

rjzamora commented Jan 10, 2024 • edited Loading

pentschev commented Jan 12, 2024

rjzamora commented Jan 10, 2024 •

edited

Loading