-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to create metadata from key/shuffle/payload in local_cudf_merge benchmark (GPU version) #375
Comments
Thanks @aamirshafi for the detailed report. I notice you're running ucx-py 0.15 but the rest of rapids (cudf, dask-cuda, rmm, etc.) on 0.14. Could you try the same version of all the packages, if possible I would suggest 0.15 nightlies. Apart from that, there are a few more comments/suggestions:
As per OFED version, we're trying to figure out how we will support it in the future, but we have no clear path ahead yet. To help us with that, would you mind sharing the output of |
Thanks @pentschev for your suggestions. It turns out that I did not have dask-cudf installed that was causing the failure in my case. We can close this issue. Since dask-cudf has been archived in favor of cudf, I was not sure if there is a need to install dask-cudf. Apparently it is still required. |
Dask-cuDF is still maintained. It just got migrated into the cuDF source tree. |
Nice catch, I totally missed that as well.
As @jakirkham wrote, dask-cudf is still a required piece for cuDF+Dask, only the GitHub repo was deprecated and the code moved into cuDF's repository. |
I am seeing errors while running local_cudf_merge.py using the UCX communication device of Dask Distributed on a cluster of GPUs connected by InfiniBand. The scenario shown below uses three nodes of the cluster.
I am able to run a custom version of local_cupy_transpose using numpy and cupy.
The following commands are used to run scheduler, 1 worker, and the client on different nodes.
Any help to fix this issue is appreciated.
Scheduler:
LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64 UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n dask-scheduler --interface ib0 --protocol ucx
Worker:
LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64:/home/shafi.16/sw/miniconda3/lib UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n dask-worker ucx://10.3.1.6:8786 --no-nanny
Client (output shows error):
LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64 UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n python local_cudf_merge.py -t gpu --scheduler-addr ucx://10.3.1.6:8786
Client (CPU Version of cuDF):
$ LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64 UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n python local_cudf_merge.py -t cpu --scheduler-addr ucx://10.3.1.6:8786
Environment Details:
The text was updated successfully, but these errors were encountered: