-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unexpected worker restarting when using UCX + RMM + XGB Dask API #337
Comments
Thanks for reporting this @daxiongshu . Could you be more specific as what you mean by "worker restart", does that mean that it happens during the workflow that a worker crashes and restarts or when you're shutting down the cluster to restart it? Does that workflow complete with successful results or not at all? If it completes, is this more of an annoyance or does it have a very negative effect in some aspect? |
Sorry for missing the key part. Yeah definitely this one.
The worker restarts at this line |
@daxiongshu I've tried your code and data and I get OOM errors:
Do you see those too? What happens here is that you're allocating the entire GPU for Dask's RMM pool, leaving no memory left for xgboost. When I decrease the pool size to 16GB it completes, even though there are some endpoint closing errors at the end, which are known issues with UCX-Py at the moment. By default xgboost won't just use the same RMM pool and end up causing the OOM errors, the solution for that would be to replace xgboost's memory manager with RMM. I don't know if that's possible today, but I'm hoping @kkraus14 would know that or could point us to someone who would know it. |
Thank you so much! Yesh I should have said it earlier. I was wondering if XGB is using rmm's memory allocator at one point. If xgb can use the rmm pool that would be great for my current application. |
@daxiongshu would you mind opening an issue on the xgboost github for discussion? cc @RAMitchell @trivialfis Closing this as this is resolved to not be an issue of dask-cuda. |
@daxiongshu did you end up opening an xgb issue? If so, can you link here ? |
Here it is: dmlc/xgboost#5861 Thank you all. |
I have this worker restart problem when using UCX + RMM + XGB Dask API with a not big dataset(7GB). The dgx station has 4xV100 GPUs which should be more than enough. Data can be downloaded here
Library versions:
Error messages are a bit random. The most common three are:
Two observations:
rmm_pool_size="31GB"
is deleted, no error occurs.rmm_pool_size="31GB"
.Code:
The text was updated successfully, but these errors were encountered: