Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpicc #135

Closed
wants to merge 12 commits into from
Closed

Mpicc #135

wants to merge 12 commits into from

Conversation

snoweye
Copy link

@snoweye snoweye commented Feb 16, 2020

  • MPI C implementations with an example tested with OpenMPI and CI in Linux and osx.
  • The application can be found in pbdXGB with CI tested in Linux (OpenMPI) and Widnows (MS-MPI).

@snoweye snoweye mentioned this pull request Feb 16, 2020
@trivialfis
Copy link
Member

Thanks for the PR and sorry for the wait! Feel free to ping me in the future ;-). Is it possible to completely replace the original c++ implementation?

@snoweye
Copy link
Author

snoweye commented Mar 3, 2020

Yes, the implementation is a complete replacement except some difference where

  • handle_ and htype_ are altered from pointer to class pointing to MPI struct to pointer pointing to MPI struct if my understanding is correct, and
  • handle_ may not be automatically free (potential memory leak) as in C++ via ReduceHandle::~ReduceHandle.

I can get rid of C++ and simplify those unnecessary steps wherever is possible. However, this had not been discussed as in the issue #127. Would others agree with the changes?

@trivialfis @hcho3 @chenqin @thvasilo @tqchen @wrathematics

@trivialfis
Copy link
Member

@snoweye I think there's an implementation with mpi backend here. Can we just replace that without any architectural change? Feel free to point out the issues here.

@snoweye
Copy link
Author

snoweye commented Mar 7, 2020

It is difficult without architectural change, though, possibly ends up rewrite/redefine most what MPI C++ deprecated binding did if keeping those MPI namespace is needed. This PR is to keep those MPI namespace when no mpi is needed, but it switches the binding where mpi or speed is needed.

@snoweye
Copy link
Author

snoweye commented Mar 10, 2020

I have the other branch mpi_replace that replace the MPI C++ code. This should work though I did not test yet.

However, it is not completely without changing architectural because I have trouble to get ride of these lines regarding ReduceFunction. Do you have any idea to resolve or by pass it? Either redefine it or cast correctly to the *fp when the ReduceHandle is initialized.

@trivialfis

@snoweye
Copy link
Author

snoweye commented Mar 25, 2020

@trivialfis Would you mind take a quick look of the new changes in the branch mpi_replace and see if that is what you expect for or else? I will do the test if there is an ok to go for.

@hcho3
Copy link
Contributor

hcho3 commented Jul 25, 2020

@trivialfis Can we merge this for now, given that this doesn't modify existing code but adds a new functionality? We can even mark it as "experimental". On a related note, we can consider adding MPI in the XGBoost testing pipeline.

@trivialfis
Copy link
Member

No. I don't want to add new API to rabit. Allow me to explain a little. Right now I have a huge headache over rabit's model recovery functionality. It saves model and model only, so anything in DMatrix is not check pointed. That's why it doesn't work for dask because we requires DMatrix to handle itself. Also there's an unclear behaviour regarding repeated calls before training. Adding more APIs will only make things more difficult to work with regarding model recovery.

@hcho3
Copy link
Contributor

hcho3 commented Nov 5, 2020

Recent developments of Rabit have been moved into the XGBoost repository. See discussion in dmlc/xgboost#5995.

@hcho3 hcho3 closed this Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants