Dynamic roles which can technically support any potential frameworks #144

Jeffwan · 2021-07-14T04:29:10Z

Every framework's implementation is pretty close and I am thinking we actually don't need that many controllers/operator. If we can support custom roles, most popular framework can adapt to it.

The major challenge is to let controller know how it can construct the environment of cluster spec. If there's a way to represent it in annotation/label etc, that might be a feasible way. I am also open to other options

zw0610 · 2021-07-14T06:36:42Z

I think it's a great idea to support custom roles. My personal experience tells me there exist many situation what we need to further extend the definition of roles. Moreover, we shall not limit the customization to the pod environment. Instead, it might be a good idea to let user to 'decorator' the pod template for each customized role.

Without changing too much to the architecture of the contemporary design of kubeflow operators, I would suggest the following approaches:

We can DecoratePod(temple *corev1.PodTemplate, rtype commonv1.ReplicaType) method in PodReconcilerInterface and let ConstructPod (ReconcilerPod -> CreatePod -> ConstructPod) to call the DecoratePod method just before return
When launching the manager, user can specify if customization server address by ReplicaType like /opt/kubeflow/tf-operator.v1 --decorator CWorker,10.1.2.9:8080,PSX,/var/psx.sock and these info will be registered in the manager.
In the implementation of BasePodReconciler (which implements the base functionality of PodReconcilerInterface), it just do nothing to the template *corev1.PodTemplate if user does not specifies the corresponding ReplicaType, otherwise it shall call the registered decorator server to update the pod template.

If developers prefer to modify the source code and re-complie & re-deploy the operator, simply override the implementation of DecoratePod in DerivedPodReconciler or XXXJobPodReconciler so it can switch to different decoration way based on the ReplicaType.
If developers prefer not to modify the existing code of the operator (like we'd like to add customized role to tf-operator without re-compiling tf-operator), just deploy the corresponding decorator server, expose it and specify the address in the launching args.

Jeffwan · 2021-07-29T03:33:19Z

Yeah. I am thinking how we can insert "clusterSpec" environment for different frameworks?

{
"worker": ["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],
"ps": ["ps0.example.com:2222","ps1.example.com:2222"]
}

different framework have different settings on this part. The most easiest way is to have some predefined templates in the code.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
  labels: 
    framework: tensorflow ->  CustomJob can leverage this label to determine how it injects the environment. 
                                           -> We can even put typology format here to further simplify controller work but it will be buggy.
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
....

Jeffwan added the enhancement New feature or request label Jul 14, 2021

Jeffwan mentioned this issue Aug 26, 2021

[Core Feature] Support Kubeflow training-operator flyteorg/flyte#1375

Closed

Jeffwan mentioned this issue Sep 6, 2021

[WIP]: add a GenericJob type and controller kubeflow/training-operator#1398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic roles which can technically support any potential frameworks #144

Dynamic roles which can technically support any potential frameworks #144

Jeffwan commented Jul 14, 2021 •

edited

Loading

zw0610 commented Jul 14, 2021 •

edited

Loading

Jeffwan commented Jul 29, 2021

Dynamic roles which can technically support any potential frameworks #144

Dynamic roles which can technically support any potential frameworks #144

Comments

Jeffwan commented Jul 14, 2021 • edited Loading

zw0610 commented Jul 14, 2021 • edited Loading

Jeffwan commented Jul 29, 2021

Jeffwan commented Jul 14, 2021 •

edited

Loading

zw0610 commented Jul 14, 2021 •

edited

Loading