Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory inputs for column split and vertical federated learning #9619

Open
rongou opened this issue Sep 28, 2023 · 7 comments
Open

In-memory inputs for column split and vertical federated learning #9619

rongou opened this issue Sep 28, 2023 · 7 comments

Comments

@rongou
Copy link
Contributor

rongou commented Sep 28, 2023

We've recently added support for column-wise data split (feature parallelism) and vertical federated learning (#8424), but the user interface in python is limited to text inputs and numpy arrays (#9365) only. We'd like to support other in-memory formats such as scipy sparse matrix, pandas data frame, cudf, and cupy.

One question is the meaning of passing in data_split_mode=COL. There are potentially two interpretations:

  • We assume each worker has access to the full dataset, passing in data_split_mode=COL would load the whole DMatrix, then split it by column according to the size of the cluster. The columns are split evenly into world_size slices, with each worker's rank determining which slice it gets. This is the approach currently used by the text inputs for feature parallel distributed training, but not for vertical federated learning.
  • We assume each worker only has access to a subset of the total number of columns, with column indices starting from 0 on every worker. The whole DMatrix is a union of all the columns from all the workers, with column indices re-indexed starting from worker 0. This is the approach currently used for vertical federated learning.

Now we want to support more in-memory inputs, it probably makes more sense to standardize on the second approach, since it seems wasteful to construct a DMatrix in memory and then slice it by column.

@rongou
Copy link
Contributor Author

rongou commented Sep 28, 2023

@trivialfis

@rongou
Copy link
Contributor Author

rongou commented Sep 28, 2023

Helps with #9472

@trivialfis
Copy link
Member

trivialfis commented Sep 28, 2023

Let's focus on the federated learning use case and remove the data splitting in XGB entirely

@rongou
Copy link
Contributor Author

rongou commented Sep 29, 2023

Sounds good. We'll standardize on the second approach, i.e. each worker only provides its own set of columns that are 0-indexed, and the global DMatrix is a union of all worker columns, re-indexed based on worker ranks.

@trivialfis
Copy link
Member

trivialfis commented Sep 29, 2023 via email

@rongou
Copy link
Contributor Author

rongou commented Oct 2, 2023

@trivialfis another question is about labels, weights, and other metadata. When doing column split distributed training (non-federated), we assume this data is available on every worker. When loading data, do we also assume this information is loaded into every worker? If not, we'd have to broadcast it from, say, worker 0.

@trivialfis
Copy link
Member

When loading data, do we also assume this information is loaded into every worker?

I think this is a fair assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants