Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337

Closed
daxiongshu opened this issue Jul 2, 2020 · 7 comments
Closed

Comments

@daxiongshu
Copy link

I have this worker restart problem when using UCX + RMM + XGB Dask API with a not big dataset(7GB). The dgx station has 4xV100 GPUs which should be more than enough. Data can be downloaded here

Library versions:

dask-core                 2.19.0                     py_0    conda-forge
dask-cuda                 0.14.1                   py37_0    rapidsai
dask-cudf                 0.14.0                   py37_0    rapidsai
dask-labextension         2.0.1                    pypi_0    pypi
dask-xgboost              0.2.0.dev28      cuda10.1py37_0    rapidsai
xgboost                   1.1.0dev.rapidsai0.14  cuda10.1py37_0    rapidsai
rmm                       0.14.0                   py37_0    rapidsai
cudf                      0.14.0                   py37_0    rapidsai

Error messages are a bit random. The most common three are:

WARNING:root:UCX failed closing worker 0x7f77e0002480 (probably already closed): UCXError('Comm Error "[Send shutdown] ep: 0x7f77e0002480, tag: 0xf85cdf408a666753, close_after_n_recv: 174": Input/output error')
distributed.scheduler - CRITICAL - Tried writing to closed comm: {'op': 'lost-data', 'key': 'tuple-38449c0f-0ac3-4e97-83e2-1a949bab964b'}
distributed.nanny - WARNING - Restarting worker

Two observations:

  1. If rmm_pool_size="31GB" is deleted, no error occurs.
  2. If data is down sampled to 10% (700MB), no error occurs even with rmm_pool_size="31GB".

Code:

import os, time
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

import cudf, cupy, time, rmm, dask, dask_cudf, dask_cuda
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import xgboost as xgb
print('XGB Version',xgb.__version__)
print('cudf Version',cudf.__version__)
print('dask Version',dask.__version__)
print('dask_cuda Version',dask_cuda.__version__)
print('dask_cudf Version',dask_cudf.__version__)

cluster = LocalCUDACluster(protocol="ucx", 
                           rmm_pool_size="31GB",
                           enable_tcp_over_ucx=True, enable_nvlink=True)
client = Client(cluster)
client

path = '/raid/data/recsys/train_split'
train = dask_cudf.read_parquet(f'{path}/train-preproc-fold-*.parquet') # 10 parquet files, 700MB each.
cols_drop = ['timestamp','engage_time','b_user_id','a_user_id',
             'a_account_creation','b_account_creation','domains','tweet_id',
             'links','hashtags0', 'hashtags1', 'fold']
train = train.drop(cols_drop,axis=1)
label_names = ['reply', 'retweet', 'retweet_comment', 'like']
features = [c for c in train.columns if c not in label_names]

Y_train = train[label_names]
train = train.drop(label_names,axis=1)
for col in train.columns:
    if train[col].dtype=='bool':
        train[col] = train[col].astype('int8')
train, Y_train= dask.persist(train,Y_train)

xgb_parms = { 
    'max_depth':8, 
    'learning_rate':0.1, 
    'subsample':0.8,
    'colsample_bytree':0.3, 
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
    'predictor' : 'gpu_predictor'
}

NROUND = 10
VERBOSE_EVAL = 50

name = label_names[0]
print('#'*25);print('###',name);print('#'*25)

start = time.time(); print('Creating DMatrix...')
dtrain = xgb.dask.DaskDMatrix(client,data=train,label=Y_train[name])
print('Took %.1f seconds'%(time.time()-start))

start = time.time(); print('Training...')
model = xgb.dask.train(client, xgb_parms, 
                       dtrain=dtrain,
                       num_boost_round=NROUND,
                       verbose_eval=VERBOSE_EVAL) 
print('Took %.1f seconds'%(time.time()-start))
@pentschev
Copy link
Member

Thanks for reporting this @daxiongshu . Could you be more specific as what you mean by "worker restart", does that mean that it happens during the workflow that a worker crashes and restarts or when you're shutting down the cluster to restart it? Does that workflow complete with successful results or not at all? If it completes, is this more of an annoyance or does it have a very negative effect in some aspect?

@daxiongshu
Copy link
Author

daxiongshu commented Jul 2, 2020

Sorry for missing the key part. Yeah definitely this one.

it happens during the workflow that a worker crashes and restarts

The worker restarts at this line model = xgb.dask.train(client, xgb_parms,. It happen immediately after this line is executed so I don't think any training happen at all. The training never completes and the program just hangs there with no results.

@pentschev
Copy link
Member

@daxiongshu I've tried your code and data and I get OOM errors:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory

Do you see those too?

What happens here is that you're allocating the entire GPU for Dask's RMM pool, leaving no memory left for xgboost. When I decrease the pool size to 16GB it completes, even though there are some endpoint closing errors at the end, which are known issues with UCX-Py at the moment.

By default xgboost won't just use the same RMM pool and end up causing the OOM errors, the solution for that would be to replace xgboost's memory manager with RMM. I don't know if that's possible today, but I'm hoping @kkraus14 would know that or could point us to someone who would know it.

@daxiongshu
Copy link
Author

Thank you so much! Yesh I should have said it earlier. I was wondering if XGB is using rmm's memory allocator at one point. If xgb can use the rmm pool that would be great for my current application.

@kkraus14
Copy link

kkraus14 commented Jul 3, 2020

@daxiongshu would you mind opening an issue on the xgboost github for discussion? cc @RAMitchell @trivialfis

Closing this as this is resolved to not be an issue of dask-cuda.

@kkraus14 kkraus14 closed this as completed Jul 3, 2020
@quasiben
Copy link
Member

quasiben commented Jul 6, 2020

@daxiongshu did you end up opening an xgb issue? If so, can you link here ?

@daxiongshu
Copy link
Author

Here it is: dmlc/xgboost#5861 Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants