[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337

daxiongshu · 2020-07-02T20:32:30Z

I have this worker restart problem when using UCX + RMM + XGB Dask API with a not big dataset(7GB). The dgx station has 4xV100 GPUs which should be more than enough. Data can be downloaded here

Library versions:

dask-core                 2.19.0                     py_0    conda-forge
dask-cuda                 0.14.1                   py37_0    rapidsai
dask-cudf                 0.14.0                   py37_0    rapidsai
dask-labextension         2.0.1                    pypi_0    pypi
dask-xgboost              0.2.0.dev28      cuda10.1py37_0    rapidsai
xgboost                   1.1.0dev.rapidsai0.14  cuda10.1py37_0    rapidsai
rmm                       0.14.0                   py37_0    rapidsai
cudf                      0.14.0                   py37_0    rapidsai

Error messages are a bit random. The most common three are:

WARNING:root:UCX failed closing worker 0x7f77e0002480 (probably already closed): UCXError('Comm Error "[Send shutdown] ep: 0x7f77e0002480, tag: 0xf85cdf408a666753, close_after_n_recv: 174": Input/output error')
distributed.scheduler - CRITICAL - Tried writing to closed comm: {'op': 'lost-data', 'key': 'tuple-38449c0f-0ac3-4e97-83e2-1a949bab964b'}
distributed.nanny - WARNING - Restarting worker

Two observations:

If rmm_pool_size="31GB" is deleted, no error occurs.
If data is down sampled to 10% (700MB), no error occurs even with rmm_pool_size="31GB".

Code:

import os, time
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

import cudf, cupy, time, rmm, dask, dask_cudf, dask_cuda
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import xgboost as xgb
print('XGB Version',xgb.__version__)
print('cudf Version',cudf.__version__)
print('dask Version',dask.__version__)
print('dask_cuda Version',dask_cuda.__version__)
print('dask_cudf Version',dask_cudf.__version__)

cluster = LocalCUDACluster(protocol="ucx", 
                           rmm_pool_size="31GB",
                           enable_tcp_over_ucx=True, enable_nvlink=True)
client = Client(cluster)
client

path = '/raid/data/recsys/train_split'
train = dask_cudf.read_parquet(f'{path}/train-preproc-fold-*.parquet') # 10 parquet files, 700MB each.
cols_drop = ['timestamp','engage_time','b_user_id','a_user_id',
             'a_account_creation','b_account_creation','domains','tweet_id',
             'links','hashtags0', 'hashtags1', 'fold']
train = train.drop(cols_drop,axis=1)
label_names = ['reply', 'retweet', 'retweet_comment', 'like']
features = [c for c in train.columns if c not in label_names]

Y_train = train[label_names]
train = train.drop(label_names,axis=1)
for col in train.columns:
    if train[col].dtype=='bool':
        train[col] = train[col].astype('int8')
train, Y_train= dask.persist(train,Y_train)

xgb_parms = { 
    'max_depth':8, 
    'learning_rate':0.1, 
    'subsample':0.8,
    'colsample_bytree':0.3, 
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
    'predictor' : 'gpu_predictor'
}

NROUND = 10
VERBOSE_EVAL = 50

name = label_names[0]
print('#'*25);print('###',name);print('#'*25)

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.dask.DaskDMatrix(client,data=train,label=Y_train[name])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.dask.train(client, xgb_parms, 
                       dtrain=dtrain,
                       num_boost_round=NROUND,
                       verbose_eval=VERBOSE_EVAL) 
print('Took %.1f seconds'%(time.time()-start))

The text was updated successfully, but these errors were encountered:

pentschev · 2020-07-02T20:46:09Z

Thanks for reporting this @daxiongshu . Could you be more specific as what you mean by "worker restart", does that mean that it happens during the workflow that a worker crashes and restarts or when you're shutting down the cluster to restart it? Does that workflow complete with successful results or not at all? If it completes, is this more of an annoyance or does it have a very negative effect in some aspect?

daxiongshu · 2020-07-02T21:21:20Z

Sorry for missing the key part. Yeah definitely this one.

it happens during the workflow that a worker crashes and restarts

The worker restarts at this line model = xgb.dask.train(client, xgb_parms,. It happen immediately after this line is executed so I don't think any training happen at all. The training never completes and the program just hangs there with no results.

pentschev · 2020-07-02T23:33:11Z

@daxiongshu I've tried your code and data and I get OOM errors:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory

Do you see those too?

What happens here is that you're allocating the entire GPU for Dask's RMM pool, leaving no memory left for xgboost. When I decrease the pool size to 16GB it completes, even though there are some endpoint closing errors at the end, which are known issues with UCX-Py at the moment.

By default xgboost won't just use the same RMM pool and end up causing the OOM errors, the solution for that would be to replace xgboost's memory manager with RMM. I don't know if that's possible today, but I'm hoping @kkraus14 would know that or could point us to someone who would know it.

daxiongshu · 2020-07-03T00:17:14Z

Thank you so much! Yesh I should have said it earlier. I was wondering if XGB is using rmm's memory allocator at one point. If xgb can use the rmm pool that would be great for my current application.

kkraus14 · 2020-07-03T00:20:47Z

@daxiongshu would you mind opening an issue on the xgboost github for discussion? cc @RAMitchell @trivialfis

Closing this as this is resolved to not be an issue of dask-cuda.

quasiben · 2020-07-06T15:03:09Z

@daxiongshu did you end up opening an xgb issue? If so, can you link here ?

daxiongshu · 2020-07-06T17:47:21Z

Here it is: dmlc/xgboost#5861 Thank you all.

kkraus14 closed this as completed Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337

[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337

daxiongshu commented Jul 2, 2020

pentschev commented Jul 2, 2020

daxiongshu commented Jul 2, 2020 •

edited

Loading

pentschev commented Jul 2, 2020

daxiongshu commented Jul 3, 2020

kkraus14 commented Jul 3, 2020

quasiben commented Jul 6, 2020

daxiongshu commented Jul 6, 2020

[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337

[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337

Comments

daxiongshu commented Jul 2, 2020

pentschev commented Jul 2, 2020

daxiongshu commented Jul 2, 2020 • edited Loading

pentschev commented Jul 2, 2020

daxiongshu commented Jul 3, 2020

kkraus14 commented Jul 3, 2020

quasiben commented Jul 6, 2020

daxiongshu commented Jul 6, 2020

daxiongshu commented Jul 2, 2020 •

edited

Loading