Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertical Federated Learning with Secure Features (secure inference and encrypted training) RFC #9987

Closed
ZiyueXu77 opened this issue Jan 15, 2024 · 19 comments

Comments

@ZiyueXu77
Copy link

ZiyueXu77 commented Jan 15, 2024

Motivation

Current XGBoost introduced the support for Federated Learning, both horizontal and vertical. However, their capability in supporting secure features is limited. Based on basic arithmetic operations - addition and multiplication - that is supported by common Homomorphic Encryption schemes (Paillier, BFV/BGV, or CKKS), the current horizontal and vertical pipeline cannot be integrated. The reason is server and/or clients need to perform operations that are not supported by HE schemes, including division and argmax.
It will be useful to implement a variation of current vertical federated learning XGBoost to provide a solution with secure features.

Secure Pattern

With the basic HE operations of addition (and potentially multiplication), for vertical FL, a feasible solution is "SecureBoost", for which all unsupported operations are performed from label-owner (server) side. The main purpose for this pipeline is to protect the label information from leaking to passive clients, and to protect each parties' feature value information from leaking to others. However in this case, passive parties will partially expose their cumulative histogram information to the active party (although one of the added features will be to minimize the potential leakage).

The main difference from our current HE solution for horizontal deep learning pipelines is:

  • For horizontal applications, aggregation server has no data (and for most FL schemes, performs no training), mostly performing aggregation only. Hence in theory, server is the "minor contributor" from model training perspective, and clients have a concern of leakage to server. So in this case, the protection is against server.
  • For vertical XGBoost application, label owner / active party holds the label, which can be considered the most valuable asset for the whole process. Therefore, the active party in this case is the "major contributor" from model training perspective, and it will have a concern of leaking this information to passive clients. So in this case, the protection is mainly against passive clients.

Other secure patterns for XGBoost, including horizontal and vertical schemes, mostly will need more sophisticated HE support, and we will put those for future work.

Goals

  • Enhance XGBoost to support secure vertical federated learning.
  • Support using NVFlare to coordinate the learning process, but the design should be amenable to support other federated learning platforms.
  • Support using any arbitrary encryption library, decoupled from xgboost via a secure interface.
  • Efficiency: training speed should be close to alternative distributed training design.
  • Accuracy: should be close to alternative vertical pipeline.

Non-Goals

  • Investigate the more sophisticated HE methods, supporting division/argmax, such that broader schemes (horizontal, vertical) can be performed within encrypted space. This will be improved upon in later iterations and from NVFlare-end.
  • We will not support data owners dropping out during the learning process.

Assumptions

Similar assumptions as our current vertical federated learning scheme:

  • private set intersection (PSI) is already done
  • A few honest-but-curious partners jointly train a model,
  • Reasonable network connection between each participant and a central party.

Risks

The current XGBoost codebase is fairly complicated and hard to modify. Some code refactoring needs to happen to implement a different vertical pipeline. Care must be taken to not break existing functionality, or make regular training harder.

Design for Encrypted Vertical Training - Part 1: alternative vertical pipeline

Our current vertical federated learning is based on “feature parallel” distributed learning.
Screenshot 2024-01-14 at 9 28 59 PM
As shown, both server (label owner) and client (stat collector) need to perform operations that are not supported by base HE solutions.
Therefore conceptually, we need to "move" the server-client split up, such that client only performs gradient collection, and server will perform the full tree construction.
Screenshot 2024-01-14 at 9 32 27 PM
In this case, local best and global best will just be two argmax steps on server side.

This alternative vertical distributed learning pipeline will be an additional solution, and will not break / replace our existing pipeline. After implementing this solution, HE components can further be added. This would benefit the wider user community, thus reducing the risks involved in directly injecting HE components by refactoring XGBoost’s code base.

Feature Parallelism - Alternative Design

Current XGBoost already has vertical / col data split for distributed training:

  • Data loading and prediction part stays unchanged.
  • Federate split / label constraint stays unchanged.
  • The label owner calculates the gradients and broadcast them to other workers.
  • The label owner broadcasts the split results to others

The change is for the particular step of finding the best split:

  • Each worker only computes the histograms based on the received gradients, and send back to label owner by performing an allgather
  • The label owner then finds the best local, then global split.

Design for Encrypted Vertical Training - Part 2: adding encryption via a processor interface

With the alternative design implemented, we have two options to add HE secure features:

  • Add them directly to the xgboost library if the HE library has native C++ support. For example, using SEAL/Paillier C++ API to encrypt the gradients before they are sent to workers. The drawback is that it will make xgboost dependent on a given HE library, which is not preferred. Update: SEAL is too expensive in either cipher-text size or computation time for this particular application, we need to adopt Paillier encryption scheme instead.
  • Add the HE features via an interface with local gRPC handler providing encryption and communication over a given message from xgboost. In short, it will receive message from xgboost/gRPC handler, performing necessary processing, and send the result (back) to xgboost. The major purpose is to properly satisfy MPI message communiction requirements with controlled actual contents. This will decouple xgboost from any particular HE solution packages, and give us flexibility of using any solutions with python binding.

With above consideration, Option 2 (interface + gRPC) serves our needs better, so we choose this path.
The design of processor interface is:

  • xgboost calls interface for data processing (serialization, etc.), providing necessary information: g/h pairs, histogram index matrix, participating row indices sets.
  • interface perfroms necessary processings and send the results back to xgboost
  • xgboost then forward the message to local gRPC handler
  • The encryption / decryption / secure add can be performed: either at local gRPC handler by reaching out to external encryption utils in Python, or at processor plugin by encryption utils in C++
  • interface performs post-processing after getting results from server upon xgboost's communication calls (broadcast, allgather) to recover proper information for xgboost

Flow chart with local handler-side encryption as shown below, this way xgboost will not have dependency on any encryption libraries:
vertical_xgboost

Flow chart with XGBoost-side encryption as shown below, the benefit will be more efficient encryption by avoiding cross-platform data conversions, while there will be extra requirements from xgboost build:
FL Workflow

Further details of the dashed lines: the interface performs pre- and post- processing between buffer and actual data, but never directly send / receive messages for mpi calls like broadcast and allgather
processor_interface

Key points:

  • xgboost will need two modifications: processor interface, and the integration code calling the interface at correct location.
  • encryption / decryption / secure aggregation will be performed at local gRPC handler on NVFlare side
  • only encrypted message will go from local to external (controlled by NVFlare), clean text info stays local

Design for Secure Evaluation and Validation during training - avoid leakage of feature cut value

Under vertical collaboration, sharing the feature "cut value" can potentially be an information leakage - if other parties learn about the precise cut value and the sample belongings, they may guess the indication of the particular feature and hence the sample characteristics.

To protect this information, during secure evaluation and validation, conceptually, active party holds two information: which passive party holds the current split feature, and which bin (the index) will be used to perform the split. Then it forwards (again, conceptually) all samples to the candidate party, who holds a lookup table to map the bin index to the actual "cut value" to perform the node split.

Hence in this case, the "model" is saved in a distributed manner - active party hold a "partial" xgb model with cut index rather than cut values, while passive parties hold the other part of the model of the lookup tables, mapping the cut index to cut values. At model deployment, all models at each party need to work collaboratively to make the final prediction.

In practice for our current xgboost implementation, we notice that: with AllReduce()/AllGather producing the global histogram, all parties will have access to the "indexing" information for the global histogram. Therefore, there is no need to implement the "lookup" table to hide the "cut point index". Instead, we can aim to only protect the "cut point value" under our current xgboost design.

In this case, we can design our secure inference as following:

  • We do not sync the cut_value across parties
  • The best split is found with global histogram, containing the feature index, and the cut index
  • The cut index will be mapped back to cut value at the passive party who owns the feature, and to be saved. Other parties will save a "NaN" value for this particular cut.

An example of the models is (Party 1 owns feature 1-8, Party 2 owns feature 9-15):
Party 1: "split_indices": [14, 2, 7, 9, 5, ...] "split_conditions": [NaN, 0.5, -134, NaN, 768, ...]
Party 2: "split_indices": [14, 2, 7, 9, 5, ...] "split_conditions": [-2.28, NaN, NaN, 56, NaN, ...]
The collaborative inference will be made by running all models together. If we merge models collected from all parties, we can produce the exact same model as our existing vertical pipeline.

Potential Considerations for Secure XGBoost

Implement an alternative vertical scheme will likely have no adverse impact on user-experience, since it only modifies the information flow without changing the base theory.
Test with encryption library by directly injecting HE encryption / decryption to the pipeline is necessary for benchmarking the performance and comparing with NVFlare native support. In the future, we shall make choices between the two, balancing between: 1) the HE efficiency, and 2) refactoring of the XGBoost code base without producing much benefits for users who have no secure needs.

Most related existing PRs

This work will mainly be an extension of XGBoost's vertical FL support.
The most related PRs to base upon and modify are:

Task list to track the progress

@trivialfis
Copy link
Member

One question for @rongou

Therefore conceptually, we need to "move" the server-client split up, such that client only performs gradient collection, and server will perform the full tree construction.

In that case, may I assume the local worker will no longer perform evaluation on gradient histogram and doesn't need to synchronize the split entry?

@rongou
Copy link
Contributor

rongou commented Jan 17, 2024

One question for @rongou

Therefore conceptually, we need to "move" the server-client split up, such that client only performs gradient collection, and server will perform the full tree construction.

In that case, may I assume the local worker will no longer perform evaluation on gradient histogram and doesn't need to synchronize the split entry?

Right, the clients (passive parties) only need to calculate the gradient histograms and send them back to the server (active party), and the active party would then do the split finding. There is no longer local vs global split finding since the active party has access to all the histograms, so it can do it in one step.

@trivialfis
Copy link
Member

Thank you for sharing, then we can start removing code in the evaluator. Is there any plan on when we can start the change?

@rongou
Copy link
Contributor

rongou commented Jan 17, 2024

I think this is an additional mode of operation. For people who don't need encryption, the current approach for vertical federated learning is much more efficient.

@ZiyueXu77
Copy link
Author

It will be an additional training mode, specifically designed for label protection by adding homomorphic encryption. It should not change current behavior. The current vertical training mode will continue to be the "mainstream" for users who do not have the needs. We need a flag / switch to select the particular vertical mode between these two.

@da-niao-dan
Copy link

da-niao-dan commented Jan 30, 2024

Subject: Proposal for Integrating SecretFlow/HEU as an HE Alternative in XGBoost

Hi XGBoost Community,

I'm a developer from the SecretFlow community, and we would like to suggest the integration of SecretFlow/HEU (https://github.com/secretflow/heu) as an alternative homomorphic encryption (HE) solution to the SEAL library in Secure XGBoost.

HEU is an open-source library dedicated to high-performance HE, offering several benefits for XGBoost:

  1. Suitability: Partial homomorphic encryption (PHE) is more suitable for XGBoost's use case than fully homomorphic encryption (FHE). The encryption tasks required in XGBoost's histogram computations are well within the scope of PHE.

  2. Performance: PHE typically offers better performance than FHE in scenarios that don't exhibit SIMD characteristics, such as XGBoost's gradient summations. Algorithms such as BFV/BGV are suitable for SIMD scenarios and require packing a large amount (such as 4096) of plaintext into one ciphertext in order to obtain relatively high performance. Otherwise, the ciphertext expansion rate and performance of FHE will be very low, not as good as PHE.

  3. Optimization: HEU includes highly optimized versions of OU and Paillier algorithms, which could enhance the performance of SecureBoost implementations in XGBoost. For benchmarks, please see: https://www.secretflow.org.cn/docs/heu/main/en-US/getting_started/algo_choice.

For more information on HEU, you can visit our documentation here: https://www.secretflow.org.cn/docs/heu/main/en-US.

We are ready to work with the XGBoost community and contribute HEU support, creating a collaborative effort with both the XGBoost and Nvidia teams for a secure and high-performance Secure XGBoost implementation. We anticipate submitting a PR that will integrate HEU into XGBoost and are looking forward to engaging with the community on this initiative.

Thank you for considering our proposal.

Kind regards,
Dan
Developer, SecretFlow Community

@ZiyueXu77
Copy link
Author

Hi @da-niao-dan
Thanks for the inputs!
Indeed, PHE schemes like Paillier satisfies the needs for SecureBoost implementation. We will be happy to test it once we have the pipeline implemented in XGBoost. Will keep you posted.

@lidh15
Copy link

lidh15 commented Feb 4, 2024

Subject: Proposal for Integrating SecretFlow/HEU as an HE Alternative in XGBoost

Hi XGBoost Community,

I'm a developer from the SecretFlow community, and we would like to suggest the integration of SecretFlow/HEU (https://github.com/secretflow/heu) as an alternative homomorphic encryption (HE) solution to the SEAL library in Secure XGBoost.

HEU is an open-source library dedicated to high-performance HE, offering several benefits for XGBoost:

  1. Suitability: Partial homomorphic encryption (PHE) is more suitable for XGBoost's use case than fully homomorphic encryption (FHE). The encryption tasks required in XGBoost's histogram computations are well within the scope of PHE.

  2. Performance: PHE typically offers better performance than FHE in scenarios that don't exhibit SIMD characteristics, such as XGBoost's gradient summations. Algorithms such as BFV/BGV are suitable for SIMD scenarios and require packing a large amount (such as 4096) of plaintext into one ciphertext in order to obtain relatively high performance. Otherwise, the ciphertext expansion rate and performance of FHE will be very low, not as good as PHE.

  3. Optimization: HEU includes highly optimized versions of OU and Paillier algorithms, which could enhance the performance of SecureBoost implementations in XGBoost. For benchmarks, please see: https://www.secretflow.org.cn/docs/heu/main/en-US/getting_started/algo_choice.

For more information on HEU, you can visit our documentation here: https://www.secretflow.org.cn/docs/heu/main/en-US.

We are ready to work with the XGBoost community and contribute HEU support, creating a collaborative effort with both the XGBoost and Nvidia teams for a secure and high-performance Secure XGBoost implementation. We anticipate submitting a PR that will integrate HEU into XGBoost and are looking forward to engaging with the community on this initiative.

Thank you for considering our proposal.

Kind regards,
Dan
Developer, SecretFlow Community

well, we believe that secretflow itself has integrated a whole secure boost workflow, how will the two implantations different from each other?
which one will be better in terms of memory usage and communication cost, xgboost's federated learning or secretflow in the future?

@da-niao-dan
Copy link

Subject: Proposal for Integrating SecretFlow/HEU as an HE Alternative in XGBoost
Hi XGBoost Community,
I'm a developer from the SecretFlow community, and we would like to suggest the integration of SecretFlow/HEU (https://github.com/secretflow/heu) as an alternative homomorphic encryption (HE) solution to the SEAL library in Secure XGBoost.
HEU is an open-source library dedicated to high-performance HE, offering several benefits for XGBoost:

  1. Suitability: Partial homomorphic encryption (PHE) is more suitable for XGBoost's use case than fully homomorphic encryption (FHE). The encryption tasks required in XGBoost's histogram computations are well within the scope of PHE.
  2. Performance: PHE typically offers better performance than FHE in scenarios that don't exhibit SIMD characteristics, such as XGBoost's gradient summations. Algorithms such as BFV/BGV are suitable for SIMD scenarios and require packing a large amount (such as 4096) of plaintext into one ciphertext in order to obtain relatively high performance. Otherwise, the ciphertext expansion rate and performance of FHE will be very low, not as good as PHE.
  3. Optimization: HEU includes highly optimized versions of OU and Paillier algorithms, which could enhance the performance of SecureBoost implementations in XGBoost. For benchmarks, please see: https://www.secretflow.org.cn/docs/heu/main/en-US/getting_started/algo_choice.

For more information on HEU, you can visit our documentation here: https://www.secretflow.org.cn/docs/heu/main/en-US.
We are ready to work with the XGBoost community and contribute HEU support, creating a collaborative effort with both the XGBoost and Nvidia teams for a secure and high-performance Secure XGBoost implementation. We anticipate submitting a PR that will integrate HEU into XGBoost and are looking forward to engaging with the community on this initiative.
Thank you for considering our proposal.
Kind regards,
Dan
Developer, SecretFlow Community

well, we believe that secretflow itself has integrated a whole secure boost workflow, how will the two implantations different from each other? which one will be better in terms of memory usage and communication cost, xgboost's federated learning or secretflow in the future?

Hi,

Thank you for your interest in the secure implementations provided by SecretFlow and XGBoost.

SecretFlow indeed includes a SecureBoost workflow, and I've had the opportunity to contribute significantly to its development. To our knowledge, the Secure Gradient Boosting (SGB) within SecretFlow is one of the fastest open-source implementations available. It features a comprehensive homomorphic encryption (HE) pipeline and includes optimizations tailored to HE calculations.

That said, the current SGB in SecretFlow encompasses only a limited subset of the functionalities offered by XGBoost. Expanding SGB's capabilities would involve a considerable effort, much like reinventing existing features, especially because many functionalities outside of HE and communication should be consistent between XGBoost and SGB in SecretFlow.

By integrating our most efficient HE optimization techniques directly into XGBoost, along with potential enhancements in security aspects, we aim to achieve the most robust version of secure XGBoost. This integration strategy not only streamlines development but also benefits the broader open-source community by leveraging XGBoost's established framework.

We believe that this collaborative approach will ultimately yield a secure XGBoost implementation that excels in both memory usage and communication efficiency, setting a new standard for federated learning.

Best regards,
Dan

@ZiyueXu77 ZiyueXu77 changed the title Vertical Federated Learning with Secure Features RFC Vertical Federated Learning with Secure Features (secure inference and encrypted training) RFC Feb 22, 2024
@ZiyueXu77
Copy link
Author

@rongou @trivialfis, I added a section for the Design for Secure Inference following our discussions. It is independent from the encrypted training part, but is definitely meaningful as @trivialfis raised. Feel free to comment.

@trivialfis
Copy link
Member

The secure boost approach sounds good, you might be able to implement the lookup table as a partial RegTree by having a subclass, if that's feasible then we should be able to simplify the code a lot.

@rongou
Copy link
Contributor

rongou commented Feb 23, 2024

With the proposed training approach, the active party would have the full model, right? But for inference, you assume it won't have the cut values. So how does one go from training to inference? Actually inference is part of the training process for evaluations.

@ZiyueXu77
Copy link
Author

With the proposed training approach, the active party would have the full model, right? But for inference, you assume it won't have the cut values. So how does one go from training to inference? Actually inference is part of the training process for evaluations.

The active party will not have the real "full model" (that can be directly used over a sample with all features). Instead, it will have a full tree model, but with feature index + feature bin index for split. The actual cut value will need to be retrieved by the feature owner with the lookup table according to the bin index. This is a tricky part because it deviates from our standard xgb model format - at inference time all party needs to be there for full parsing.

@rongou
Copy link
Contributor

rongou commented Feb 23, 2024

This is beyond what the SecureBoost paper is doing, right? Agreed it's a bit tricky since we don't have a single global model any more.

@ZiyueXu77
Copy link
Author

This is beyond what the SecureBoost paper is doing, right? Agreed it's a bit tricky since we don't have a single global model any more.

It's part of the secureboost solution, although independent from the encryption part.

@rongou
Copy link
Contributor

rongou commented Feb 23, 2024

Ah I see it does talk about the lookup tables, which need to be part of the model on each passive party.

@da-niao-dan
Copy link

Hello, I see the vertical pipeline PR is already merged. We are now ready to work on the encryption part, right?

@ZiyueXu77
Copy link
Author

ZiyueXu77 commented Mar 13, 2024

Hello, I see the vertical pipeline PR is already merged. We are now ready to work on the encryption part, right?

Hi, the merged PR is only the "Part 1" for the whole secure pipeline, we have other three PRs necessary to get the pipeline working. And I see your solution is Paillier based, which is the same as our choice, so it is good. With the "Processor interface", it will be easy to integrate the pipeline with any given encryption libraries.

Details of the other 3 PRs:

  • secure evaluation/validation during training: this one is independent from encryption, and is almost ready for merge.
  • Processor interface, this is the most relevant PR, please visit the flow chart details above - you may consider contributing your encryption scheme here in xgboost, or to our federated learning platform NVFlare (where we will perform encryption/decryption)
  • Integration code updates for the interface

@ZiyueXu77
Copy link
Author

Major implementation done following the RFC, close as complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants