Skip to content

Lightweight framework for distributed TensorFlow training based on dmlc/rabit

License

Notifications You must be signed in to change notification settings

criteo/tf-collective-all-reduce

Repository files navigation

tf-collective-all-reduce

Lightweight framework for distributing machine learning training based on Rabit for the communication layer. We borrowed Horovod's concepts for the TensorFlow optimizer wrapper.

Installation

git clone https://github.com/criteo/tf-collective-all-reduce
python3.6 -m venv tf_env
. tf_env/bin/activate
pip install tensorflow==1.12.2
pushd tf-collective-all-reduce
  ./install.sh
  pip install -e .
popd

Prerequisites

tf-collective-all-reduce only supports Python ≥3.6

Run tests

pip install -r tests-requirements.txt
pytest -s

Local run with dmlc-submit

../dmlc-core/tracker/dmlc-submit --cluster local --num-workers 2 python examples/simple/simple_allreduce.py

Run on a Hadoop cluster with tf-yarn

Run collective_all_reduce_example

cd examples/tf-yarn
python collective_all_reduce_example.py

About

Lightweight framework for distributed TensorFlow training based on dmlc/rabit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •