Distributed XGBoost is now part of Wormhole. Checkout this Link for usage examples, build and job submissions.
- The distributed version is built on Rabit:Reliable Allreduce and Broadcast Library
- Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning
- This makes xgboost portable and fault-tolerant against node failures
- Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
- The Hadoop version does not rely on Mapreduce to do iterations
- You can expect xgboost not suffering the drawbacks of iterative MapReduce program
- The design choice was made because Allreduce is very natural and efficient for distributed tree building
- In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
- The multi-threading nature of xgboost is inheritated in distributed mode
- This means xgboost efficiently use all the threads in one machine, and communicates only between machines
- Remember to run on xgboost process per machine and this will give you maximum speedup
- For more information about rabit and how it works, see the Rabit's Tutorial
- Column-based solver split data by column, each node work on subset of columns, it uses exactly the same algorithm as single node version.
- Row-based solver split data by row, each node work on subset of rows,
it uses an approximate histogram count algorithm, and will only examine subset of
potential split points as opposed to all split points.
- This is the mode used by current hadoop version, since usually data was stored by rows in many industry system