-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[breaking] Change DMatrix construction to be distributed #9623
Conversation
include/xgboost/data.h
Outdated
@@ -559,8 +559,9 @@ class DMatrix { | |||
* | |||
* \param uri The URI of input. | |||
* \param silent Whether print information during loading. | |||
* \param data_split_mode In distributed mode, split the input according this mode; otherwise, | |||
* it's just an indicator on how the input was split beforehand. | |||
* \param data_split_mode In distributed mode, if the data split mode is row, split the input by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the row splitting as well. I can help with that if it doesn't conflict with your PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that's what we want to do? In the text readers there is logic to optimize for reading a portion of the file, so in terms of performance it's probably not that far off from reading a whole file. This behavior has been around since the beginning of xgboost, I wonder how many people actually rely on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for removing the old code, could you please help fix the lint errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it in a separated PR for backporting: #9634 .
Currently when we load a DMatrix from text files, we assume each worker has access to the whole dataset. Data is split by row or column depending on the
data_split_mode
parameter. This is different from vertical federated learning, where each worker only has access to its own subset of data.With this change we now assume each worker only loads its own subset of data. For column-wise split, after data loading the columns are re-indexed to reflect the global view.
In support of #9619