-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support bootstrap allreduce/broadcast #98
Conversation
This reverts commit 2a28e5e.
call involving all nodes with unique cache key. if all nodes call getcache at same time, we keep rabit run collective call. If some nodes call getcache while others not, we backfill cache from those nodes with most entries
@chenqin Can we simply throw error when the user is not using Linux and enables bootstrapping? We can add non-Linux support later. |
Per convo offline, it still works even with env not supporting _buildin_func, thanks to the fact that we also use type_size and count as part of signature as well. If there is a collision of signature as we change implementation or adding new tree method we could do So it is still safe |
are we comfortable to merge this back to master? @hcho3 |
so I have left all my comments here, the PR has improved a lot since I put |
going to asking around @tqchen @CodingCat @hcho3 @trivialfis can we merge this? |
Done |
removed is_bootstrap parameter in upcoming pr |
The goal of this pr is to implement immutable cache in rabit to help failed worker recover not syned allreduces in bootstrap time. examples of such are
It's specifically designed to help recovered node catch up with rest of nodes with minimal overhead in non recovery mode.
in addational, this pr also includes
design doc