IBMFL Multi-Cloud and Hybrid Cloud Orchestrator automates the deployment and monitoring of aggregator and party process using federated learning library docker image on OpenShift clusters which are setup on different cloud data center regions. IBMFL OpenShift Orchestrator works on either single or multi clusters.
- Authentication using kubeconfig file to multiple OpenShift Clusters
- Creation and deployment of aggregator and party pods to OpenShift Clusters
- Network communication between aggregator and party process using OpenShift Routes
- Runs experiment by coordinating execution of training commands between aggregator and party process
- Support for multiple trials and parallelization of multiple experiments
- Logs and metrics capture for experiments
Please refer to examples/iter_avg_openshift
folder on how to use IBMFL OpenShift Orchestrator to run IBMFL experiments. In the example, follow instructions in README.md
and README_Multiple_Exp.md
to run single and multiple federated experiments respectively.
To run orchestrator on your laptop, you can install the OpenShift community edition like OKD (https://www.okd.io/) or minshift (https://www.okd.io/minishift/).
Please refer to config_openshift_sample.yml
file for various keys of config file. The config file consists of two main sections: cluster
and experiments
.
cluster
contains the openshift cluster configuration for aggregator and party pods.
key | description | mandatory | default_value |
---|---|---|---|
agg_pod:cpu |
cpu for aggregator pod | no | 2 |
agg_pod:mem |
memory for aggregator pod | no | 4Gi |
party_pod:cpu |
cpu for party pod | no | 2 |
party_pod:mem |
memory for party pod | no | 4Gi |
kube_config_location |
location of kube config file | no | ~/.kube/config |
-
experiments
contains the configuration for federated learning experiments.default
contains the default settings for all experiments in the list.
key | description | mandatory | default_value |
---|---|---|---|
exec_mode |
run experiments in sequential or parallel mode, options: [seq,parallel] |
no | seq |
image_name |
Name of IBMFL image installed in OpenShift image streams | no | ibmfl:latest |
commands:aggregator |
FL commands to execute from aggregator pod | no | ['START', 'TRAIN', 'EVAL', 'STOP'] |
Each item in the experiment list consist of following keys
key | description | mandatory | default_value |
---|---|---|---|
staging_dir |
local directory where experiment config and data files are stored | yes | |
name |
unique name for experiment | no | ibmfl |
num_trials |
number of trials for experiment | no | 1 |
cluster_list |
List of OpenShift clusters connection details | yes | |
data:pvc_name |
persistent volume claim name for COS bucket which holds the datasets and model artifacts | no |
For COS bucket access, please set up Persistent Volume (PV) and Persistent Volume Claim (PVC) in OpenShift Cluster and provide the PVC name in data:pvc_name
key.
Each item in the cluster list consist of following keys
key | description | mandatory | default_value |
---|---|---|---|
context_name |
context name of OpenShift Cluster as defined in kube config file | yes | |
namespace |
namespace of OpenShift Cluster as defined in kube config file | yes |
If multiple openshift clusters are configured, the aggregator pod will be deployed in the first cluster of the list and parties will be equally split among clusters.
- Orchestrator logs can be found at
error.log
andinfo.log
for debugging. - Orchestrator collects aggregator and party logs inside experiment folder to debug aggregator and party training processes.
- In the event orchestrator process gets stuck at pod creation step, you can use the follow commands to debug
oc get pods
to view pod statusoc logs -f [pod_name]
to view pod logsoc delete pod [pod_name]
to delete any hanging pods