Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Train] Force GBDTTrainer to use distributed loading for Ray Datasets (…
…ray-project#31079) Signed-off-by: amogkam <[email protected]> Closes ray-project#31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged. ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release. Signed-off-by: tmynn <[email protected]>
- Loading branch information