[FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ? #9372

WeichenXu123 · 2023-07-08T10:15:48Z

One of our customer is facing a issue of Xgboost training run out of memory :

The customer trains xgboost model in distributed way, and when setting xgboost "max_depth" to a high value, the training routine easily runs out of memory.

The issue is, they found, for some of their dataset, using a specific "max_depth" value it works fine, but for some other dataset, using the same "max_depth" but OOM occurs,

we hope to set a larger "max_depth" in most cases for better model accuracy, but, we also need to ensure preventing OOM happening, then this is a pain point,

Can we add a param like "xgboost_train_worker_cpu_memory_usage_threshold" and "xgboost_train_worker_GPU_memory_usage_threshold", and xgboost training worker tracks its memory usage, when it finds it exceeds the threshold, then it stop increasing the model depth and finalize the model and then stop training ?

Related ticket: #9342

trivialfis · 2023-07-08T10:34:20Z

Same issue as #9342 I'm looking into it, might take some time.

WeichenXu123 changed the title ~~[FR] Xgboost training run out of memory~~ [FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ? Jul 8, 2023

This was referenced Aug 2, 2023

[WIP] Bound the size of the histogram cache. #9432

Closed

Unify the code path between local and distributed training. #9433

Merged

Bound the size of the histogram cache. #9440

Merged

trivialfis closed this as completed in #9440 Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ? #9372

[FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ? #9372

WeichenXu123 commented Jul 8, 2023 •

edited

Loading

trivialfis commented Jul 8, 2023 •

edited

Loading

[FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ? #9372

[FR] Xgboost training run out of memory in some cases, can we add memory threshold config to prevent OOM ? #9372

Comments

WeichenXu123 commented Jul 8, 2023 • edited Loading

trivialfis commented Jul 8, 2023 • edited Loading

WeichenXu123 commented Jul 8, 2023 •

edited

Loading

trivialfis commented Jul 8, 2023 •

edited

Loading