Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Dynamic roles which can technically support any potential frameworks #144

Open
Jeffwan opened this issue Jul 14, 2021 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Jul 14, 2021

Every framework's implementation is pretty close and I am thinking we actually don't need that many controllers/operator. If we can support custom roles, most popular framework can adapt to it.

The major challenge is to let controller know how it can construct the environment of cluster spec. If there's a way to represent it in annotation/label etc, that might be a feasible way. I am also open to other options

@Jeffwan Jeffwan added the enhancement New feature or request label Jul 14, 2021
@zw0610
Copy link
Member

zw0610 commented Jul 14, 2021

I think it's a great idea to support custom roles. My personal experience tells me there exist many situation what we need to further extend the definition of roles. Moreover, we shall not limit the customization to the pod environment. Instead, it might be a good idea to let user to 'decorator' the pod template for each customized role.

Without changing too much to the architecture of the contemporary design of kubeflow operators, I would suggest the following approaches:

  1. We can DecoratePod(temple *corev1.PodTemplate, rtype commonv1.ReplicaType) method in PodReconcilerInterface and let ConstructPod (ReconcilerPod -> CreatePod -> ConstructPod) to call the DecoratePod method just before return
  2. When launching the manager, user can specify if customization server address by ReplicaType like /opt/kubeflow/tf-operator.v1 --decorator CWorker,10.1.2.9:8080,PSX,/var/psx.sock and these info will be registered in the manager.
  3. In the implementation of BasePodReconciler (which implements the base functionality of PodReconcilerInterface), it just do nothing to the template *corev1.PodTemplate if user does not specifies the corresponding ReplicaType, otherwise it shall call the registered decorator server to update the pod template.
  • If developers prefer to modify the source code and re-complie & re-deploy the operator, simply override the implementation of DecoratePod in DerivedPodReconciler or XXXJobPodReconciler so it can switch to different decoration way based on the ReplicaType.
  • If developers prefer not to modify the existing code of the operator (like we'd like to add customized role to tf-operator without re-compiling tf-operator), just deploy the corresponding decorator server, expose it and specify the address in the launching args.

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 29, 2021

Yeah. I am thinking how we can insert "clusterSpec" environment for different frameworks?

{
"worker": ["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],
"ps": ["ps0.example.com:2222","ps1.example.com:2222"]
}

different framework have different settings on this part. The most easiest way is to have some predefined templates in the code.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
  labels: 
    framework: tensorflow ->  CustomJob can leverage this label to determine how it injects the environment. 
                                           -> We can even put typology format here to further simplify controller work but it will be buggy.
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
....

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants