You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of our customer is facing a issue of Xgboost training run out of memory :
The customer trains xgboost model in distributed way, and when setting xgboost "max_depth" to a high value, the training routine easily runs out of memory.
The issue is, they found, for some of their dataset, using a specific "max_depth" value it works fine, but for some other dataset, using the same "max_depth" but OOM occurs,
we hope to set a larger "max_depth" in most cases for better model accuracy, but, we also need to ensure preventing OOM happening, then this is a pain point,
Can we add a param like "xgboost_train_worker_cpu_memory_usage_threshold" and "xgboost_train_worker_GPU_memory_usage_threshold", and xgboost training worker tracks its memory usage, when it finds it exceeds the threshold, then it stop increasing the model depth and finalize the model and then stop training ?
The text was updated successfully, but these errors were encountered:
WeichenXu123
changed the title
[FR] Xgboost training run out of memory
[FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ?
Jul 8, 2023
One of our customer is facing a issue of Xgboost training run out of memory :
The customer trains xgboost model in distributed way, and when setting xgboost "max_depth" to a high value, the training routine easily runs out of memory.
The issue is, they found, for some of their dataset, using a specific "max_depth" value it works fine, but for some other dataset, using the same "max_depth" but OOM occurs,
we hope to set a larger "max_depth" in most cases for better model accuracy, but, we also need to ensure preventing OOM happening, then this is a pain point,
Can we add a param like "xgboost_train_worker_cpu_memory_usage_threshold" and "xgboost_train_worker_GPU_memory_usage_threshold", and xgboost training worker tracks its memory usage, when it finds it exceeds the threshold, then it stop increasing the model depth and finalize the model and then stop training ?
Related ticket: #9342
The text was updated successfully, but these errors were encountered: