Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support bootstrap allreduce/broadcast #98

Merged
merged 95 commits into from
Aug 28, 2019
Merged

support bootstrap allreduce/broadcast #98

merged 95 commits into from
Aug 28, 2019

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Jul 2, 2019

The goal of this pr is to implement immutable cache in rabit to help failed worker recover not syned allreduces in bootstrap time. examples of such are

  • distributed xgboost load dataset and sync number of columns before loadcheckpoint.
  • init histgram in fast hist algorithm only runs before completion of first iteration checkpoint

It's specifically designed to help recovered node catch up with rest of nodes with minimal overhead in non recovery mode.

  • setcache only writes locally after allreduce/broadcast complete
  • getcache send/recv two integer in one allreduce call and decide if all nodes are running in sync
  • when nodes are in sync, skip reading cache, only recovery nodes in bootstrap do one time catch rebuild from nearest nodes

in addational, this pr also includes

  • rabit_cache flag to opt-in this feature
  • rabit_debug flag to opt-in allreduce/broadcast/loadcheckpoint/checkpoint operation detail log for debugging usage

design doc

Chen Qin and others added 29 commits June 10, 2019 14:00
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries
@chenqin chenqin marked this pull request as ready for review July 2, 2019 22:46
@chenqin
Copy link
Contributor Author

chenqin commented Aug 19, 2019

@hcho3

@hcho3
Copy link
Contributor

hcho3 commented Aug 20, 2019

@chenqin Can we simply throw error when the user is not using Linux and enables bootstrapping? We can add non-Linux support later.

@chenqin
Copy link
Contributor Author

chenqin commented Aug 22, 2019

@chenqin Can we simply throw error when the user is not using Linux and enables bootstrapping? We can add non-Linux support later.

Per convo offline, it still works even with env not supporting _buildin_func, thanks to the fact that we also use type_size and count as part of signature as well.

image

If there is a collision of signature as we change implementation or adding new tree method we could do

  1. explicit change default value and avoid colision
    image

  2. if we forgot to do so, it will trigger assertion error
    image

So it is still safe

src/allreduce_base.cc Outdated Show resolved Hide resolved
@chenqin
Copy link
Contributor Author

chenqin commented Aug 26, 2019

are we comfortable to merge this back to master? @hcho3

CMakeLists.txt Outdated Show resolved Hide resolved
src/allreduce_base.cc Outdated Show resolved Hide resolved
src/allreduce_robust.h Outdated Show resolved Hide resolved
@CodingCat
Copy link
Member

so I have left all my comments here, the PR has improved a lot since I put change requested previously, but GitHub doesn't allow me to change my opinion and I feel that I am not 100% confident on reviewing all tech details here. I will defer to @hcho3 for final decision

@chenqin
Copy link
Contributor Author

chenqin commented Aug 27, 2019

going to asking around @tqchen @CodingCat @hcho3 @trivialfis can we merge this?

@hcho3 hcho3 merged commit 5797dcb into dmlc:master Aug 28, 2019
@hcho3
Copy link
Contributor

hcho3 commented Aug 28, 2019

Done

@chenqin
Copy link
Contributor Author

chenqin commented Sep 5, 2019

removed is_bootstrap parameter in upcoming pr
69789ec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants