Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support bootstrap allreduce/broadcast #98

Merged
merged 95 commits into from
Aug 28, 2019
Merged
Show file tree
Hide file tree
Changes from 79 commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
ed06620
support run rabit tests as xgboost subproject using xgboost/dmlc-core
Jun 5, 2019
dddcac7
support tracker config set/get
chenqin Jun 11, 2019
2fac91b
remove redudant printf
chenqin Jun 12, 2019
f5a9727
remove redudant printf
chenqin Jun 12, 2019
acc011d
Merge branch 'master' of https://github.com/chenqin/rabit
chenqin Jun 13, 2019
e391238
add c++0x declaration
chenqin Jun 18, 2019
a9d7331
log allreduce/broadcast caller, engine should track caller stack for
chenqin Jun 19, 2019
2a28e5e
tracker support binary config format
chenqin Jun 19, 2019
2c322f3
Revert "tracker support binary config format"
chenqin Jun 19, 2019
19ce2de
remove caller, prototype fetch allreduce/broadcast results from resbuf
chenqin Jun 21, 2019
4d0a1e1
store cached allreduce/broadcast seq_no to tracker
chenqin Jun 23, 2019
35bcd61
allow restore all caches from other nodes
chenqin Jun 23, 2019
68135c8
try new rabit collective cache, todo: recv_link seems down
chenqin Jun 24, 2019
87c62ea
link up cache restore with main recovery
Jun 24, 2019
05d5f01
cleanup load cache state
chenqin Jun 25, 2019
2f95628
update cache api
Jun 26, 2019
1178bac
pass test.mk
chenqin Jun 26, 2019
435f681
have a working tests
chenqin Jun 27, 2019
22571a2
try to unify check into actionsummary
chenqin Jun 27, 2019
809181c
more logging to debug distributed hist three method issue
Jun 28, 2019
289d916
update rabit interface to support caller signature matching
chenqin Jun 28, 2019
c033576
splite seq_counter from cur_cache_seq to different variables
chenqin Jun 28, 2019
95330c6
still see issue with inf loop
chenqin Jun 28, 2019
73a6746
support debug print caller as well as allreduce op
chenqin Jun 28, 2019
337b447
cleanup
Jun 29, 2019
ee234c4
remove get/set cache from model_recover, adding recover in
chenqin Jul 1, 2019
ef235c4
clarify rabit cache strategy, cache is set only by successful collective
chenqin Jul 2, 2019
98380bc
revert caller logs
Jul 2, 2019
db343f9
fix lint error
Jul 2, 2019
0353257
adding tests
Jul 2, 2019
9206f87
fix engine mpi signature
Jul 3, 2019
87c9063
support getcache by ref
Jul 4, 2019
c19e9d3
allow result buffer presiet to filestream
chenqin Jul 6, 2019
2bd993a
add loging
chenqin Jul 8, 2019
1e82c33
try fix checkpoint failure recovery case
Jul 9, 2019
a98e6a1
use int64_t to avoid overflow caused seq fault
chenqin Jul 11, 2019
2c139bc
try avoid int overflow
chenqin Jul 11, 2019
9ca3e45
try fix checkpoint failure recovery case
Jul 9, 2019
0fbe3ea
try avoid seqno overflow to negative by offseting specifial flag value
Jul 11, 2019
74de682
fix cache seq assert error
Jul 11, 2019
aa2153f
remove loging, handle edge case
Jul 12, 2019
0a522a7
add extensive log to checkpoint state with different seq no
Jul 12, 2019
53316b7
fix lint errors
chenqin Jul 17, 2019
3ef4a68
clean up comments before merge back to master
Jul 17, 2019
c3d6535
Merge pull request #2 from chenqin/test
chenqin Jul 17, 2019
70f5b33
add logs to allreduce/broadcast/checkpoint
Jul 18, 2019
b71f662
use unsinged int 32 and give seq no larger range
Jul 18, 2019
b5410e0
address remove allreduce dropseq code segment
Jul 18, 2019
31428dc
Merge branch 'master' into test
chenqin Jul 19, 2019
dda159a
Merge pull request #3 from chenqin/test
chenqin Jul 19, 2019
e098c55
using caller signature to filter bootstrapallreduces
chenqin Jul 20, 2019
2e2f687
remove get/set cache from empty
chenqin Jul 21, 2019
9514bef
apply signature to reducer
chenqin Jul 21, 2019
ae5540f
apply signature to broadcast
chenqin Jul 21, 2019
4502cfd
add key to broadcat log
chenqin Jul 21, 2019
f0d011d
fix broadcast signature
chenqin Jul 21, 2019
ab8cd44
fix default _line value for non linux system
chenqin Jul 21, 2019
70fff5b
adding comments, remove sleep(1)
chenqin Jul 21, 2019
93fdd8d
fix osx build issue
Jul 22, 2019
bc95cfb
try fix mpi
Jul 22, 2019
7b6f46e
fix doc
Jul 22, 2019
c232ff5
fix engine_empty api
Jul 22, 2019
73af820
logging, adding more logs, restore immutable assertion
Jul 22, 2019
aac6abc
print unsinged int with ud
chenqin Jul 23, 2019
96f9ab3
fix lint
chenqin Jul 23, 2019
043c61f
rename seqtype to kSeq and KCache indicating it's usage
chenqin Jul 23, 2019
f65b217
Merge branch 'test' of github.com:chenqin/rabit into test
chenqin Jul 23, 2019
985c7aa
Merge pull request #4 from chenqin/test
chenqin Jul 23, 2019
d32d6db
comment allreduce/broadcast log
chenqin Jul 30, 2019
cfa2b76
allow tests run on arm
Jul 30, 2019
4d0ff27
enable flag to turn on / off cache
Jul 30, 2019
779fb24
add log info alert if user choose to enable rabit bootstrap cache
Jul 30, 2019
cdf6dc8
add rabit_debug setting so user can use config to turn on
chenqin Jul 31, 2019
964336e
log flags when user turn on rabit_debug
Aug 1, 2019
a9055f6
force rabit restart if tracker assign -1 rank
chenqin Aug 6, 2019
f07061c
use OPENMP to vecotrize reducer
chenqin Aug 7, 2019
1dc61f3
address comment
Aug 8, 2019
0f08403
Revert "address comment"
chenqin Aug 12, 2019
968bf0e
fix checkpoint size print 0
chenqin Aug 12, 2019
e72ac34
per feedback, remove DISABLEOPEMP, address race condition
Aug 15, 2019
a0f45a3
- remove openmp from this pr
Aug 15, 2019
b3ed40b
add default value of signature macros
Aug 15, 2019
cd0a25b
remove openmp from cmake file
Aug 15, 2019
fe4e2ff
Update src/allreduce_robust.cc
chenqin Aug 15, 2019
c6fcfdf
Update src/allreduce_robust.cc
chenqin Aug 15, 2019
1b62f71
run test with cmake
chenqin Aug 17, 2019
7e873dc
Merge branch 'master' of github.com:chenqin/rabit
chenqin Aug 18, 2019
b7979de
remove openmp
chenqin Aug 23, 2019
3d34ae3
fix cmake based tests
Aug 23, 2019
94ffe0f
use cmake test fix darwin .dylib issue
Aug 23, 2019
391fba4
move around rabit_signature definition due to windows build
chenqin Aug 25, 2019
a49819a
misc, add c++ check in CMakeFile
Aug 15, 2019
d3cd57e
per feedback
Aug 26, 2019
40133e9
resolve CMake file
Aug 26, 2019
45eb4f4
update rabit version
Aug 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
cmake_minimum_required(VERSION 3.0)
cmake_minimum_required(VERSION 3.3)

project(rabit VERSION 0.2.0)
project(rabit VERSION 0.3)
chenqin marked this conversation as resolved.
Show resolved Hide resolved
find_package(OpenMP)

option(RABIT_BUILD_TESTS "Build rabit tests" OFF)
option(RABIT_BUILD_MPI "Build MPI" OFF)
option(RABIT_BUILD_DMLC "Include DMLC_CORE in build" ON)

if(OpenMP_CXX_FOUND OR OPENMP_FOUND)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()

add_library(rabit src/allreduce_base.cc src/allreduce_robust.cc src/engine.cc src/c_api.cc)
add_library(rabit_base src/allreduce_base.cc src/engine_base.cc src/c_api.cc)
add_library(rabit_empty src/engine_empty.cc src/c_api.cc)
Expand Down
63 changes: 22 additions & 41 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,61 +2,25 @@ OS := $(shell uname)

RABIT_BUILD_DMLC = 0

ifeq ($(RABIT_BUILD_DMLC),1)
DMLC=dmlc-core
else
DMLC=../dmlc-core
endif

export WARNFLAGS= -Wall -Wextra -Wno-unused-parameter -Wno-unknown-pragmas -std=c++11
export CFLAGS = -O3 $(WARNFLAGS) -I $(DMLC)/include -I include/
export CFLAGS = -O3 $(WARNFLAGS)
export LDFLAGS =-Llib

#download mpi
#echo $(shell scripts/mpi.sh)

MPICXX=./mpich/bin/mpicxx

ifeq ($(OS), Darwin)
ifndef CC
export CC = gcc-4.9
endif
ifndef CXX
export CXX = g++-4.9
endif
else
ifeq ($(OS), FreeBSD)
ifndef CXX
export CXX = g++6
endif
export LDFLAGS= -Llib -Wl,-rpath=/usr/local/lib/gcc6
else
# linux defaults
ifndef CC
export CC = gcc
endif
ifndef CXX
export CXX = g++
endif
LDFLAGS +=-lrt
endif
endif
export CXX = g++


#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), powerpc64le ppc64le ))
USE_SSE=0
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
CFLAGS += -march=native
else
USE_SSE=1
endif

ifndef USE_SSE
USE_SSE = 1
endif

ifeq ($(USE_SSE), 1)
CFLAGS += -msse2
endif

Expand All @@ -71,6 +35,23 @@ ifndef LINT_LANG
LINT_LANG="all"
endif

OPENMP_FLAGS =
ifeq ($(USE_OPENMP), 1)
OPENMP_FLAGS = -fopenmp
else
OPENMP_FLAGS = -DDISABLE_OPENMP
chenqin marked this conversation as resolved.
Show resolved Hide resolved
endif
CFLAGS += $(OPENMP_FLAGS)


ifeq ($(RABIT_BUILD_DMLC),1)
DMLC=dmlc-core
else
DMLC=../dmlc-core
endif

CFLAGS += -I $(DMLC)/include -I include/

# build path
BPATH=.
# objectives that makes up rabit library
Expand Down
41 changes: 36 additions & 5 deletions include/rabit/internal/engine.h
Original file line number Diff line number Diff line change
Expand Up @@ -54,20 +54,36 @@ class IEngine {
* will be called by the function before performing Allreduce in order to initialize the data in sendrecvbuf.
* If the result of Allreduce can be recovered directly, then prepare_func will NOT be called
* \param prepare_arg argument used to pass into the lazy preprocessing function
* \param is_bootstrap if this allreduce is needed to bootstrap failed node
* \param _file caller file name used to generate unique cache key
* \param _line caller line number used to generate unique cache key
* \param _caller caller function name used to generate unique cache key
*/
virtual void Allreduce(void *sendrecvbuf_,
size_t type_nbytes,
size_t count,
ReduceFunction reducer,
PreprocFunction prepare_fun = NULL,
void *prepare_arg = NULL) = 0;
void *prepare_arg = NULL,
bool is_bootstrap = false,
const char* _file = _FILE,
const int _line = _LINE,
const char* _caller = _CALLER) = 0;
/*!
* \brief broadcasts data from root to every other node
* \param sendrecvbuf_ buffer for both sending and receiving data
* \param size the size of the data to be broadcasted
* \param root the root worker id to broadcast the data
* \param is_bootstrap if this broadcast is needed to bootstrap failed node
* \param _file caller file name used to generate unique cache key
* \param _line caller line number used to generate unique cache key
* \param _caller caller function name used to generate unique cache key
*/
virtual void Broadcast(void *sendrecvbuf_, size_t size, int root) = 0;
virtual void Broadcast(void *sendrecvbuf_, size_t size, int root,
bool is_bootstrap = false,
const char* _file = _FILE,
const int _line = _LINE,
const char* _caller = _CALLER) = 0;
/*!
* \brief explicitly re-initialize everything before calling LoadCheckPoint
* call this function when IEngine throws an exception,
Expand Down Expand Up @@ -204,6 +220,10 @@ enum DataType {
* will be called by the function before performing Allreduce, to initialize the data in sendrecvbuf_.
* If the result of Allreduce can be recovered directly, then prepare_func will NOT be called
* \param prepare_arg argument used to pass into the lazy preprocessing function.
* \param is_bootstrap if this allreduce is needed to bootstrap failed node
* \param _file caller file name used to generate unique cache key
* \param _line caller line number used to generate unique cache key
* \param _caller caller function name used to generate unique cache key
*/
void Allreduce_(void *sendrecvbuf,
size_t type_nbytes,
Expand All @@ -212,8 +232,11 @@ void Allreduce_(void *sendrecvbuf,
mpi::DataType dtype,
mpi::OpType op,
IEngine::PreprocFunction prepare_fun = NULL,
void *prepare_arg = NULL);

void *prepare_arg = NULL,
bool is_bootstrap = false,
const char* _file = _FILE,
const int _line = _LINE,
const char* _caller = _CALLER);
/*!
* \brief handle for customized reducer, used to handle customized reduce
* this class is mainly created for compatiblity issues with MPI's customized reduce
Expand All @@ -239,12 +262,20 @@ class ReduceHandle {
* will be called by the function before performing Allreduce in order to initialize the data in sendrecvbuf_.
* If the result of Allreduce can be recovered directly, then prepare_func will NOT be called
* \param prepare_arg argument used to pass into the lazy preprocessing function
* \param is_bootstrap if this allreduce is needed to bootstrap failed node
* \param _file caller file name used to generate unique cache key
* \param _line caller line number used to generate unique cache key
* \param _caller caller function name used to generate unique cache key
*/
void Allreduce(void *sendrecvbuf,
size_t type_nbytes,
size_t count,
IEngine::PreprocFunction prepare_fun = NULL,
void *prepare_arg = NULL);
void *prepare_arg = NULL,
bool is_bootstrap = false,
const char* _file = _FILE,
const int _line = _LINE,
const char* _caller = _CALLER);
/*! \return the number of bytes occupied by the type */
static int TypeSize(const MPI::Datatype &dtype);

Expand Down
Loading