[breaking] Change DMatrix construction to be distributed #9623

rongou · 2023-10-02T19:32:11Z

Currently when we load a DMatrix from text files, we assume each worker has access to the whole dataset. Data is split by row or column depending on the data_split_mode parameter. This is different from vertical federated learning, where each worker only has access to its own subset of data.

With this change we now assume each worker only loads its own subset of data. For column-wise split, after data loading the columns are re-indexed to reflect the global view.

In support of #9619

trivialfis · 2023-10-03T21:21:24Z

include/xgboost/data.h

@@ -559,8 +559,9 @@ class DMatrix {
   *
   * \param uri The URI of input.
   * \param silent Whether print information during loading.
-   * \param data_split_mode In distributed mode, split the input according this mode; otherwise,
-   *                        it's just an indicator on how the input was split beforehand.
+   * \param data_split_mode In distributed mode, if the data split mode is row, split the input by


Let's remove the row splitting as well. I can help with that if it doesn't conflict with your PR.

Are you sure that's what we want to do? In the text readers there is logic to optimize for reading a portion of the file, so in terms of performance it's probably not that far off from reading a whole file. This behavior has been around since the beginning of xgboost, I wonder how many people actually rely on it.

Thank you for removing the old code, could you please help fix the lint errors?

I will fix it in a separated PR for backporting: #9634 .

Change column-split DMatrix construction to be distributed

47dd8f0

trivialfis reviewed Oct 3, 2023

View reviewed changes

remove splitting code for row split

3f8f70b

rongou changed the title ~~Change column-split DMatrix construction to be distributed~~ Change DMatrix construction to be distributed Oct 5, 2023

trivialfis changed the title ~~Change DMatrix construction to be distributed~~ [breaking] Change DMatrix construction to be distributed Oct 8, 2023

Merge remote-tracking branch 'upstream/master' into no-slicecol

8df1f50

trivialfis approved these changes Oct 10, 2023

View reviewed changes

trivialfis merged commit 0ecb4de into dmlc:master Oct 10, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[breaking] Change DMatrix construction to be distributed #9623

[breaking] Change DMatrix construction to be distributed #9623

rongou commented Oct 2, 2023 •

edited

Loading

trivialfis Oct 3, 2023

rongou Oct 3, 2023 •

edited

Loading

rongou Oct 5, 2023

trivialfis Oct 8, 2023

trivialfis Oct 8, 2023

[breaking] Change DMatrix construction to be distributed #9623

[breaking] Change DMatrix construction to be distributed #9623

Conversation

rongou commented Oct 2, 2023 • edited Loading

trivialfis Oct 3, 2023

Choose a reason for hiding this comment

rongou Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

rongou Oct 5, 2023

Choose a reason for hiding this comment

trivialfis Oct 8, 2023

Choose a reason for hiding this comment

trivialfis Oct 8, 2023

Choose a reason for hiding this comment

rongou commented Oct 2, 2023 •

edited

Loading

rongou Oct 3, 2023 •

edited

Loading