-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical Federated Learning with Secure Features (secure inference and encrypted training) RFC #9987
Comments
One question for @rongou
In that case, may I assume the local worker will no longer perform evaluation on gradient histogram and doesn't need to synchronize the split entry? |
Right, the clients (passive parties) only need to calculate the gradient histograms and send them back to the server (active party), and the active party would then do the split finding. There is no longer local vs global split finding since the active party has access to all the histograms, so it can do it in one step. |
Thank you for sharing, then we can start removing code in the evaluator. Is there any plan on when we can start the change? |
I think this is an additional mode of operation. For people who don't need encryption, the current approach for vertical federated learning is much more efficient. |
It will be an additional training mode, specifically designed for label protection by adding homomorphic encryption. It should not change current behavior. The current vertical training mode will continue to be the "mainstream" for users who do not have the needs. We need a flag / switch to select the particular vertical mode between these two. |
Subject: Proposal for Integrating SecretFlow/HEU as an HE Alternative in XGBoost Hi XGBoost Community, I'm a developer from the SecretFlow community, and we would like to suggest the integration of SecretFlow/HEU (https://github.com/secretflow/heu) as an alternative homomorphic encryption (HE) solution to the SEAL library in Secure XGBoost. HEU is an open-source library dedicated to high-performance HE, offering several benefits for XGBoost:
For more information on HEU, you can visit our documentation here: https://www.secretflow.org.cn/docs/heu/main/en-US. We are ready to work with the XGBoost community and contribute HEU support, creating a collaborative effort with both the XGBoost and Nvidia teams for a secure and high-performance Secure XGBoost implementation. We anticipate submitting a PR that will integrate HEU into XGBoost and are looking forward to engaging with the community on this initiative. Thank you for considering our proposal. Kind regards, |
Hi @da-niao-dan |
well, we believe that secretflow itself has integrated a whole secure boost workflow, how will the two implantations different from each other? |
Hi, Thank you for your interest in the secure implementations provided by SecretFlow and XGBoost. SecretFlow indeed includes a SecureBoost workflow, and I've had the opportunity to contribute significantly to its development. To our knowledge, the Secure Gradient Boosting (SGB) within SecretFlow is one of the fastest open-source implementations available. It features a comprehensive homomorphic encryption (HE) pipeline and includes optimizations tailored to HE calculations. That said, the current SGB in SecretFlow encompasses only a limited subset of the functionalities offered by XGBoost. Expanding SGB's capabilities would involve a considerable effort, much like reinventing existing features, especially because many functionalities outside of HE and communication should be consistent between XGBoost and SGB in SecretFlow. By integrating our most efficient HE optimization techniques directly into XGBoost, along with potential enhancements in security aspects, we aim to achieve the most robust version of secure XGBoost. This integration strategy not only streamlines development but also benefits the broader open-source community by leveraging XGBoost's established framework. We believe that this collaborative approach will ultimately yield a secure XGBoost implementation that excels in both memory usage and communication efficiency, setting a new standard for federated learning. Best regards, |
@rongou @trivialfis, I added a section for the Design for Secure Inference following our discussions. It is independent from the encrypted training part, but is definitely meaningful as @trivialfis raised. Feel free to comment. |
The secure boost approach sounds good, you might be able to implement the lookup table as a partial |
With the proposed training approach, the active party would have the full model, right? But for inference, you assume it won't have the cut values. So how does one go from training to inference? Actually inference is part of the training process for evaluations. |
The active party will not have the real "full model" (that can be directly used over a sample with all features). Instead, it will have a full tree model, but with feature index + feature bin index for split. The actual cut value will need to be retrieved by the feature owner with the lookup table according to the bin index. This is a tricky part because it deviates from our standard xgb model format - at inference time all party needs to be there for full parsing. |
This is beyond what the SecureBoost paper is doing, right? Agreed it's a bit tricky since we don't have a single global model any more. |
It's part of the secureboost solution, although independent from the encryption part. |
Ah I see it does talk about the lookup tables, which need to be part of the model on each passive party. |
Hello, I see the vertical pipeline PR is already merged. We are now ready to work on the encryption part, right? |
Hi, the merged PR is only the "Part 1" for the whole secure pipeline, we have other three PRs necessary to get the pipeline working. And I see your solution is Paillier based, which is the same as our choice, so it is good. With the "Processor interface", it will be easy to integrate the pipeline with any given encryption libraries. Details of the other 3 PRs:
|
Major implementation done following the RFC, close as complete |
Motivation
Current XGBoost introduced the support for Federated Learning, both horizontal and vertical. However, their capability in supporting secure features is limited. Based on basic arithmetic operations - addition and multiplication - that is supported by common Homomorphic Encryption schemes (Paillier, BFV/BGV, or CKKS), the current horizontal and vertical pipeline cannot be integrated. The reason is server and/or clients need to perform operations that are not supported by HE schemes, including division and argmax.
It will be useful to implement a variation of current vertical federated learning XGBoost to provide a solution with secure features.
Secure Pattern
With the basic HE operations of addition (and potentially multiplication), for vertical FL, a feasible solution is "SecureBoost", for which all unsupported operations are performed from label-owner (server) side. The main purpose for this pipeline is to protect the label information from leaking to passive clients, and to protect each parties' feature value information from leaking to others. However in this case, passive parties will partially expose their cumulative histogram information to the active party (although one of the added features will be to minimize the potential leakage).
The main difference from our current HE solution for horizontal deep learning pipelines is:
Other secure patterns for XGBoost, including horizontal and vertical schemes, mostly will need more sophisticated HE support, and we will put those for future work.
Goals
Non-Goals
Assumptions
Similar assumptions as our current vertical federated learning scheme:
Risks
The current XGBoost codebase is fairly complicated and hard to modify. Some code refactoring needs to happen to implement a different vertical pipeline. Care must be taken to not break existing functionality, or make regular training harder.
Design for Encrypted Vertical Training - Part 1: alternative vertical pipeline
Our current vertical federated learning is based on “feature parallel” distributed learning.
As shown, both server (label owner) and client (stat collector) need to perform operations that are not supported by base HE solutions.
Therefore conceptually, we need to "move" the server-client split up, such that client only performs gradient collection, and server will perform the full tree construction.
In this case, local best and global best will just be two argmax steps on server side.
This alternative vertical distributed learning pipeline will be an additional solution, and will not break / replace our existing pipeline. After implementing this solution, HE components can further be added. This would benefit the wider user community, thus reducing the risks involved in directly injecting HE components by refactoring XGBoost’s code base.
Feature Parallelism - Alternative Design
Current XGBoost already has vertical /
col
data split for distributed training:The change is for the particular step of finding the best split:
Design for Encrypted Vertical Training - Part 2: adding encryption via a processor interface
With the alternative design implemented, we have two options to add HE secure features:
With above consideration, Option 2 (interface + gRPC) serves our needs better, so we choose this path.
The design of processor interface is:
Flow chart with local handler-side encryption as shown below, this way xgboost will not have dependency on any encryption libraries:
Flow chart with XGBoost-side encryption as shown below, the benefit will be more efficient encryption by avoiding cross-platform data conversions, while there will be extra requirements from xgboost build:
Further details of the dashed lines: the interface performs pre- and post- processing between buffer and actual data, but never directly send / receive messages for mpi calls like broadcast and allgather
Key points:
Design for Secure Evaluation and Validation during training - avoid leakage of feature cut value
Under vertical collaboration, sharing the feature "cut value" can potentially be an information leakage - if other parties learn about the precise cut value and the sample belongings, they may guess the indication of the particular feature and hence the sample characteristics.
To protect this information, during secure evaluation and validation, conceptually, active party holds two information: which passive party holds the current split feature, and which bin (the index) will be used to perform the split. Then it forwards (again, conceptually) all samples to the candidate party, who holds a lookup table to map the bin index to the actual "cut value" to perform the node split.
Hence in this case, the "model" is saved in a distributed manner - active party hold a "partial" xgb model with cut index rather than cut values, while passive parties hold the other part of the model of the lookup tables, mapping the cut index to cut values. At model deployment, all models at each party need to work collaboratively to make the final prediction.
In practice for our current xgboost implementation, we notice that: with AllReduce()/AllGather producing the global histogram, all parties will have access to the "indexing" information for the global histogram. Therefore, there is no need to implement the "lookup" table to hide the "cut point index". Instead, we can aim to only protect the "cut point value" under our current xgboost design.
In this case, we can design our secure inference as following:
An example of the models is (Party 1 owns feature 1-8, Party 2 owns feature 9-15):
Party 1: "split_indices": [14, 2, 7, 9, 5, ...] "split_conditions": [NaN, 0.5, -134, NaN, 768, ...]
Party 2: "split_indices": [14, 2, 7, 9, 5, ...] "split_conditions": [-2.28, NaN, NaN, 56, NaN, ...]
The collaborative inference will be made by running all models together. If we merge models collected from all parties, we can produce the exact same model as our existing vertical pipeline.
Potential Considerations for Secure XGBoost
Implement an alternative vertical scheme will likely have no adverse impact on user-experience, since it only modifies the information flow without changing the base theory.
Test with encryption library by directly injecting HE encryption / decryption to the pipeline is necessary for benchmarking the performance and comparing with NVFlare native support. In the future, we shall make choices between the two, balancing between: 1) the HE efficiency, and 2) refactoring of the XGBoost code base without producing much benefits for users who have no secure needs.
Most related existing PRs
This work will mainly be an extension of XGBoost's vertical FL support.
The most related PRs to base upon and modify are:
approx
tree method #8847Task list to track the progress
The text was updated successfully, but these errors were encountered: