Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[breaking] Change DMatrix construction to be distributed #9623

Merged
merged 3 commits into from
Oct 10, 2023

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Oct 2, 2023

Currently when we load a DMatrix from text files, we assume each worker has access to the whole dataset. Data is split by row or column depending on the data_split_mode parameter. This is different from vertical federated learning, where each worker only has access to its own subset of data.

With this change we now assume each worker only loads its own subset of data. For column-wise split, after data loading the columns are re-indexed to reflect the global view.

In support of #9619

@@ -559,8 +559,9 @@ class DMatrix {
*
* \param uri The URI of input.
* \param silent Whether print information during loading.
* \param data_split_mode In distributed mode, split the input according this mode; otherwise,
* it's just an indicator on how the input was split beforehand.
* \param data_split_mode In distributed mode, if the data split mode is row, split the input by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the row splitting as well. I can help with that if it doesn't conflict with your PR.

Copy link
Contributor Author

@rongou rongou Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that's what we want to do? In the text readers there is logic to optimize for reading a portion of the file, so in terms of performance it's probably not that far off from reading a whole file. This behavior has been around since the beginning of xgboost, I wonder how many people actually rely on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for removing the old code, could you please help fix the lint errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it in a separated PR for backporting: #9634 .

@rongou rongou changed the title Change column-split DMatrix construction to be distributed Change DMatrix construction to be distributed Oct 5, 2023
@trivialfis trivialfis changed the title Change DMatrix construction to be distributed [breaking] Change DMatrix construction to be distributed Oct 8, 2023
@trivialfis trivialfis merged commit 0ecb4de into dmlc:master Oct 10, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants