Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to create metadata from key/shuffle/payload in local_cudf_merge benchmark (GPU version) #375

Closed
aamirshafi opened this issue Aug 15, 2020 · 4 comments

Comments

@aamirshafi
Copy link

I am seeing errors while running local_cudf_merge.py using the UCX communication device of Dask Distributed on a cluster of GPUs connected by InfiniBand. The scenario shown below uses three nodes of the cluster.

I am able to run a custom version of local_cupy_transpose using numpy and cupy.

The following commands are used to run scheduler, 1 worker, and the client on different nodes.

Any help to fix this issue is appreciated.

Scheduler:

LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64 UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n dask-scheduler --interface ib0 --protocol ucx

Worker:

LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64:/home/shafi.16/sw/miniconda3/lib UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n dask-worker ucx://10.3.1.6:8786 --no-nanny

Client (output shows error):

LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64 UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n python local_cudf_merge.py -t gpu --scheduler-addr ucx://10.3.1.6:8786

get_cluster_options
get_cluster_options: else
returning cluster object
inside run()
calling get_random_ddf with build
get_random_ddf
generate_chunk
chunk type is build
xdf.DataFrame
start= 0
stop= 4
local_size= 4
parts_array [0]
suffle_array [0 0 0 0]
shuffle [0 0 0 0]
payload [0 3 1 2]
calling new_dd_object
graph =  {('generate-data-845b9508fbd08ec4c46725ff5029af24', 0): (<function generate_chunk at 0x2b83aaef7320>, 0, 1000000, 1, 'build', 0.3, True)}
name =  generate-data-845b9508fbd08ec4c46725ff5029af24
meta =     key  shuffle  payload
0    0        0        0
1    1        0        2
2    2        0        3
3    3        0        1
divisions =  [None, None]
Traceback (most recent call last):
  File "local_cudf_merge.py", line 579, in <module>
    main(parse_args())
  File "local_cudf_merge.py", line 250, in main
    took_list.append(run(client, args, n_workers, write_profile=None))
  File "local_cudf_merge.py", line 196, in run
    args.chunk_size, n_workers, args.frac_match, "build", args
  File "local_cudf_merge.py", line 151, in get_random_ddf
    ddf = new_dd_object(graph, name, meta, divisions)
  File "/home/shafi.16/sw/miniconda3/envs/dask/lib/python3.7/site-packages/dask/dataframe/core.py", line 6311, in new_dd_object
    return get_parallel_type(meta)(dsk, name, meta, divisions)
  File "/home/shafi.16/sw/miniconda3/envs/dask/lib/python3.7/site-packages/dask/dataframe/core.py", line 119, in __init__
    meta = make_meta(meta)
  File "/home/shafi.16/sw/miniconda3/envs/dask/lib/python3.7/site-packages/dask/utils.py", line 507, in __call__
    return meth(arg, *args, **kwargs)
  File "/home/shafi.16/sw/miniconda3/envs/dask/lib/python3.7/site-packages/dask/dataframe/utils.py", line 364, in make_meta_object
    raise TypeError("Don't know how to create metadata from {0}".format(x))
TypeError: Don't know how to create metadata from    key  shuffle  payload
0    0        0        0
1    1        0        2
2    2        0        3
3    3        0        1

Client (CPU Version of cuDF):

$ LD_LIBRARY_PATH=/opt/gdrcopy2.0/lib64 UCXPY_NON_BLOCKING_MODE=1 UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,sm,gdr_copy,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n python local_cudf_merge.py -t cpu --scheduler-addr ucx://10.3.1.6:8786

get_cluster_options
get_cluster_options: else
returning cluster object
inside run()
calling get_random_ddf with build
get_random_ddf
generate_chunk
chunk type is build
xdf.DataFrame
start= 0
stop= 4
local_size= 4
parts_array [0]
suffle_array [0 0 0 0]
shuffle [0 0 0 0]
payload [1 2 0 3]
calling new_dd_object
graph =  {('generate-data-845b9508fbd08ec4c46725ff5029af24', 0): (<function generate_chunk at 0x2b5525646320>, 0, 1000000, 1, 'build', 0.3, False)}
name =  generate-data-845b9508fbd08ec4c46725ff5029af24
meta =     key  shuffle  payload
0    0        0        3
1    1        0        1
2    2        0        2
3    3        0        0
.. .. ..
.. .. ..
divisions =  [None, None]
called new_dd_object
calling get_random_ddf with other
get_random_ddf
generate_chunk
chunk type is other
xdf.DataFrame
calling new_dd_object
graph =  {('generate-data-ec53664f0c43a372ef95ed87984423df', 0): (<function generate_chunk at 0x2b5525646320>, 0, 1000000, 1, 'other', 0.3, False)}
name =  generate-data-ec53664f0c43a372ef95ed87984423df
meta =     key  payload
0    0        0
1    2        1
2    1        3
3    3        2
divisions =  [None, None]
called new_dd_object
merge called
Merge benchmark
-------------------------------
backend        | dask
merge type     | cpu
rows-per-chunk | 1000000
protocol       | tcp
device(s)      | 0
rmm-pool       | True
frac-match     | 0.3
data-processed | 32.00 MB
===============================
Wall-clock     | Throughput
-------------------------------
1.56 s         | 20.53 MB/s
1.80 s         | 17.73 MB/s
1.60 s         | 20.04 MB/s
===============================
(w1,w2)     | 25% 50% 75% (total nbytes)
-------------------------------

Environment Details:

$ conda info
     active environment : dask
    active env location : /home/shafi.16/sw/miniconda3/envs/dask
            shell level : 2
       user config file : /home/shafi.16/.condarc
 populated config files : 
          conda version : 4.8.4
    conda-build version : not installed
         python version : 3.7.7.final.0
       virtual packages : __cuda=11.0
                          __glibc=2.17
       base environment : /home/shafi.16/sw/miniconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/shafi.16/sw/miniconda3/pkgs
                          /home/shafi.16/.conda/pkgs
       envs directories : /home/shafi.16/sw/miniconda3/envs
                          /home/shafi.16/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.4 requests/2.24.0 CPython/3.7.7 Linux/3.10.0-693.17.1.el7.x86_64 centos/7.6.1810 glibc/2.17
                UID:GID : 1235:1027
             netrc file : None
           offline mode : False

$ conda list
# packages in environment at /home/shafi.16/sw/miniconda3/envs/dask:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
appdirs                   1.4.4                    pypi_0    pypi
arrow-cpp                 0.15.0           py37h090bef1_2    conda-forge
attrs                     19.3.0                     py_0  
autoconf                  2.69            pl526hebd4dad_5  
automake                  1.16.2                  pl526_1    conda-forge
blas                      1.0                         mkl  
bokeh                     2.1.1            py37hc8dfbb8_0    conda-forge
boost-cpp                 1.70.0               ha2d47e9_1    conda-forge
brotli                    1.0.7             he1b5a44_1004    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
c-ares                    1.16.1               h516909a_0    conda-forge
ca-certificates           2020.6.24                     0  
certifi                   2020.6.20                py37_0  
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cloudpickle               1.5.0                      py_0    conda-forge
cudatoolkit               10.2.89              h6bb024c_0    nvidia
cudf                      0.14.0                   py37_0    rapidsai
cudnn                     7.6.5                cuda10.2_0  
cupy                      7.7.0            py37h940342b_0    conda-forge
cupy-cuda102              7.7.0                    pypi_0    pypi
cython                    0.29.21          py37h3340039_0    conda-forge
cytoolz                   0.10.1           py37h516909a_0    conda-forge
dask                      2.23.0                     py_0    conda-forge
dask-core                 2.23.0                     py_0    conda-forge
dask-cuda                 0.14.1                   py37_0    rapidsai
dask-mpi                  2.0.0+14.g9564954.dirty          pypi_0    pypi
decorator                 4.4.2                    pypi_0    pypi
distributed               2.21.0+20.g9b7eaa2.dirty          pypi_0    pypi
dlpack                    0.3                  he1b5a44_1    conda-forge
double-conversion         3.1.5                he1b5a44_2    conda-forge
fastavro                  0.24.0           py37h8f50634_0    conda-forge
fastrlock                 0.5                      pypi_0    pypi
freetype                  2.10.2               he06d7ca_0    conda-forge
fsspec                    0.8.0                      py_0    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
glog                      0.4.0                h49b9bf7_3    conda-forge
grpc-cpp                  1.23.0               h18db393_0    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
icu                       58.2                 he6710b0_3  
importlib-metadata        1.7.0                    py37_0  
importlib_metadata        1.7.0                         0  
iniconfig                 1.0.1                      py_0  
intel-openmp              2020.1                      217  
jinja2                    2.11.2             pyh9f0ad1d_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
lcms2                     2.11                 hbd6801e_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_7  
libcudf                   0.14.0               cuda10.2_0    rapidsai
libedit                   3.1.20191231         h14c3975_1  
libevent                  2.1.10               hcdb4288_1    conda-forge
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h24d8f2e_15    conda-forge
libgomp                   9.3.0               h24d8f2e_15    conda-forge
libhwloc                  2.1.0                h3c4fd83_0    conda-forge
libllvm9                  9.0.1                h4a3c616_1  
libnvstrings              0.14.0               cuda10.2_0    rapidsai
libpng                    1.6.37               hed695b0_2    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
librmm                    0.14.0               cuda10.2_0    rapidsai
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                hc7e4089_6    conda-forge
libtool                   2.4.6             h516909a_1003    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
libxml2                   2.9.10               he19cac6_1  
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
llvmlite                  0.33.0           py37hc6ec683_1  
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
m4                        1.4.18               h4e445db_0  
make                      4.3                  h516909a_0    conda-forge
mako                      1.1.3                    pypi_0    pypi
markupsafe                1.1.1            py37h8f50634_1    conda-forge
mkl                       2020.1                      217  
mkl-service               2.3.0            py37he904b0f_0  
mkl_fft                   1.1.0            py37h23d657b_0  
mkl_random                1.1.1            py37h0573a6f_0  
more-itertools            8.4.0                      py_0  
mpi4py                    3.1.0a0                  pypi_0    pypi
msgpack-python            1.0.0            py37h99015e2_1    conda-forge
nccl                      2.7.8.1              hc6a2c23_0    conda-forge
ncurses                   6.2                  he6710b0_1  
numba                     0.50.1           py37h0da4684_1    conda-forge
numpy                     1.19.1           py37hbc911f0_0  
numpy-base                1.19.1           py37hfa32c7d_0  
nvstrings                 0.14.0                   py37_0    rapidsai
olefile                   0.46                       py_0    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1g               h7b6447c_0  
packaging                 20.4                       py_0  
pandas                    0.25.3           py37hb3f55d8_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
partd                     1.1.0                      py_0    conda-forge
perl                      5.26.2               h14c3975_0  
pillow                    7.2.0            py37h718be6c_1    conda-forge
pip                       20.2.2                   py37_0  
pkg-config                0.29.2            h516909a_1006    conda-forge
pluggy                    0.13.1                   py37_0  
psutil                    5.7.2            py37h8f50634_0    conda-forge
py                        1.9.0                      py_0  
pyarrow                   0.15.0           py37h8b68381_1    conda-forge
pycuda                    2019.1.2                 pypi_0    pypi
pynvml                    8.0.4                      py_1    conda-forge
pyparsing                 2.4.7                      py_0  
pytest                    6.0.1            py37hc8dfbb8_0    conda-forge
pytest-asyncio            0.12.0           py37hc8dfbb8_1    conda-forge
python                    3.7.7                hcff3b4d_5  
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytools                   2020.3.1                 pypi_0    pypi
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
pyyaml                    5.3.1            py37h8f50634_0    conda-forge
re2                       2020.04.01           he1b5a44_0    conda-forge
readline                  8.0                  h7b6447c_0  
rmm                       0.14.0                   py37_0    rapidsai
setuptools                49.6.0           py37hc8dfbb8_0    conda-forge
six                       1.15.0                     py_0  
snappy                    1.1.8                he1b5a44_3    conda-forge
sortedcontainers          2.2.2              pyh9f0ad1d_0    conda-forge
spdlog                    1.7.0                hc9558a2_2    conda-forge
sqlite                    3.32.3               h62c20be_0  
tblib                     1.6.0                      py_0    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.10               hbc83047_0  
toml                      0.10.1                     py_0  
toolz                     0.10.0                     py_0    conda-forge
tornado                   6.0.4            py37h8f50634_1    conda-forge
typing_extensions         3.7.4.2                    py_0    conda-forge
ucr-py                    0.1                      pypi_0    pypi
ucx                       1.8.0+gf6ec8d4      cuda10.2_20    rapidsai
ucx-proc                  1.0.0                       gpu    conda-forge
ucx-py                    0.15.0a0+167.gc44a451          pypi_0    pypi
uriparser                 0.9.3                he1b5a44_1    conda-forge
wheel                     0.34.2                   py37_0  
xz                        5.2.5                h7b6447c_0  
yaml                      0.2.5                h516909a_0    conda-forge
zict                      2.0.0                    pypi_0    pypi
zipp                      3.1.0                      py_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.4                h3b9ef0a_2    conda-forge
$ nvidia-smi 
Sat Aug 15 13:29:27 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   48C    P0    37W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@pentschev
Copy link
Member

Thanks @aamirshafi for the detailed report. I notice you're running ucx-py 0.15 but the rest of rapids (cudf, dask-cuda, rmm, etc.) on 0.14. Could you try the same version of all the packages, if possible I would suggest 0.15 nightlies.

Apart from that, there are a few more comments/suggestions:

  1. We don't currently support or test rc_x at all, I'm not sure what's its status, so I'd suggest rc;
  2. There are issues with sm in UCX 1.8 (see Disable UCP_FEATURE_WAKEUP in non-blocking mode ucx-py#565), so that doesn't work at the moment;
  3. We've recently found out that there's no ABI compatibility guarantees between OFED versions, which makes it impossible for us to support it at the moment and current packages may not work properly with IB, which is forcing us to drop support entirely for conda packages in 0.15, that will be done in the upcoming days and the only proper solution at the moment is to build UCX from source so that it builds against the system's OFED;
  4. dask-cuda has several improvements that are important for performance, I'd suggest dask-cuda-worker instead of dask-cuda.

As per OFED version, we're trying to figure out how we will support it in the future, but we have no clear path ahead yet. To help us with that, would you mind sharing the output of ofed_info in your system?

@aamirshafi
Copy link
Author

Thanks @pentschev for your suggestions.

It turns out that I did not have dask-cudf installed that was causing the failure in my case. We can close this issue. Since dask-cudf has been archived in favor of cudf, I was not sure if there is a need to install dask-cudf. Apparently it is still required.

@jakirkham
Copy link
Member

Dask-cuDF is still maintained. It just got migrated into the cuDF source tree.

@pentschev
Copy link
Member

It turns out that I did not have dask-cudf installed that was causing the failure in my case.

Nice catch, I totally missed that as well.

Since dask-cudf has been archived in favor of cudf, I was not sure if there is a need to install dask-cudf. Apparently it is still required.

As @jakirkham wrote, dask-cudf is still a required piece for cuDF+Dask, only the GitHub repo was deprecated and the code moved into cuDF's repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants