Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to fail gracefull on Rabit::Init failure #51

Closed
wants to merge 1 commit into from

Conversation

ebernhardson
Copy link

When running xgboost4j-spark the process that rabit
is initialized in has other things going on, such as
caches of data that is expensive to recompute. On
the rare occasion that rabit fails to connect to the
tracker rabit performs an exit(-1) which throws away
everything that was going on in the application.

It would be nice to somehow provide explicit failure
handling everywhere, but changing the external api
to that level would be quite a large breaking change.
This patch takes a very targeted approach changing
only the Init call to return a boolean indicating
success. This is disabled by default and must be
provided as part of the initialization parameters.

@ebernhardson
Copy link
Author

ebernhardson commented Dec 5, 2017

This isn't tested yet, more just throwing an idea up there. Before i spend more time on this I wanted to see if this is something desirable, and if this is a reasonable direction or if something else would be more appropriate?

When running xgboost4j-spark the process that rabit
is initialized in has other things going on, such as
caches of data that is expensive to recompute. On
the rare occasion that rabit fails to connect to the
tracker rabit performs an exit(-1) which throws away
everything that was going on in the application.

It would be nice to somehow provide explicit failure
handling everywhere, but changing the external api
to that level would be quite a large breaking change.
This patch takes a very targeted approach changing
only the Init call to return a boolean indicating
success. This is disabled by default and must be
provided as part of the initialization parameters.
@hcho3
Copy link
Contributor

hcho3 commented Nov 5, 2020

Recent developments of Rabit have been moved into the XGBoost repository. See discussion in dmlc/xgboost#5995.

@hcho3 hcho3 closed this Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants