Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Support different PS/worker types #1369

Closed
gaocegege opened this issue Aug 17, 2021 · 19 comments
Closed

[feature] Support different PS/worker types #1369

gaocegege opened this issue Aug 17, 2021 · 19 comments

Comments

@gaocegege
Copy link
Member

In some customer cases, users want to schedule one PS for one GPU machine, and place other PSes in CPU machines, like this:

  tfReplicaSpecs:
    PS-1:
      replicas: 3
      template:
        spec:
          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                 matchExpressions:
                 - key: gpu-type
                    operator: In
                    values:
                    - true
               topologyKey: topology.kubernetes.io/zone
          containers:
            - name: tensorflow
              image: xxx
    PS-2:
      replicas: 5
      template:
        spec:
          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                 matchExpressions:
                 - key: gpu-type
                    operator: In
                    values:
                    - false
               topologyKey: topology.kubernetes.io/zone
          containers:
            - name: tensorflow
              image: xxx

/cc @zw0610

@zw0610
Copy link
Member

zw0610 commented Aug 17, 2021

I've implemented this feature for internal use. If the community believe this feature is desirable, let me push the changes to this repository.

@gaocegege
Copy link
Member Author

It is widely needed in the industry, I think.

/cc @kubeflow/wg-training-leads

@johnugeorge
Copy link
Member

Interesting. Can you give one example here? When would a user prefer GPU for some while CPU. for other PS?

@zw0610
Copy link
Member

zw0610 commented Aug 17, 2021

Interesting. Can you give one example here? When would a user prefer GPU for some while CPU. for other PS?

I can offer two cases:

Case 1: Diversity on ParameterServer

For a PS/Worker distributed training task, as we wish to maximize the communication bandwidth between workers and parameter servers, once possible, we shall co-allocate workers and parameter servers. However, there might leave some parameter servers that cannot fit into node as the resource on a node is limited. In this case, we need to configure parameter servers with distinguished pod affinity.

Case 2: Diversity on Worker

When (tf/mpi/pytorch-)jobs are deployed in a cluster with fragmented resource left, such like 2c4g on this node or 4c2g on that node. With the traditional TFJob specification, these resources shall be permanently abandoned. However, a diverse TFJob can configure workers with multiple resource configuration so any left fragmented resources can be utilized by a worker in any diverse TFJobs.

Of source, to run a worker on a diverse TFJob, optimizer in TensorFlow should be carefully chosen to perform asynchronized gradient updating as well as learning rate adjusting for workers with various batch size.

@gaocegege
Copy link
Member Author

When would a user prefer GPU for some while CPU. for other PS?

It's more about scheduling. For example, there is a GPU node with 2 GPU, 64 CPU, and 126G memory. One work uses 1 GPU, 25 CPU, and 40G. Then there can be another PS that uses 14 CPU, 46G memory.

Besides this, there are some CPU nodes with 16 CPU, 32G. Then PSes in CPU nodes should use 16 CPU, 32G.

@gaocegege
Copy link
Member Author

gaocegege commented Aug 17, 2021

�Maybe we could wait until all-in-one is released.

@johnugeorge
Copy link
Member

Got it. Because of the resource allocation, need arises

@terrytangyuan
Copy link
Member

This would require a customized optimizer and dynamic batch sizing algorithm though. I am curious to see if there's any good practices around this from your internal experiments.

@zw0610
Copy link
Member

zw0610 commented Aug 23, 2021

This would require a customized optimizer and dynamic batch sizing algorithm though. I am curious to see if there's any good practices around this from your internal experiments.

Definitely, a new optimizer that can cope with this volatile environment is required. But diverse parameter server mode, a regular optimizer supporting asyncrhonized updating is sufficient. dynamic bs adjusting mainly comes from the diverse worker mode.

Regarding experiments, we probably need to provide this feature in tf-operator as the experiment environment for algorithm researchers to proceed.

@WEICHINLIN
Copy link

Has this function been implemented?
I want to know how to implement this function.

@zw0610
Copy link
Member

zw0610 commented Nov 22, 2021

Has this function been implemented? I want to know how to implement this function.

I only have an implementation for poc based on v0.5.3: https://github.com/zw0610/tf-operator/tree/diverse-worker

@gaocegege
Copy link
Member Author

gaocegege commented Nov 22, 2021

Has this function been implemented? I want to know how to implement this function.

Can you please describe your use scenario? Maybe there is another workaround.

@WEICHINLIN
Copy link

My previous expression is not correct.
I want to understand the purpose of this ISSUE.
Including whether the above Yaml file can be executed, and the problem ISSUE want to solve.
I didn't quite understand after reading the previous comments.

@gaocegege
Copy link
Member Author

We want to support heterogeneous PS/worker.

For example, some workers use CPU to train while some others use GPU.

@WEICHINLIN
Copy link

This means that distributed training has both ps task and work task, so that cpu and gpu can act on training at the same time?

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Apr 30, 2022
@Windfarer
Copy link
Contributor

Any update about this issue? Heterogeneous workers have become increasingly important in the current training of LLMs.

@gaocegege
Copy link
Member Author

I do not think it is in the roadmap.

@andreyvelich
Copy link
Member

@gaocegege @Windfarer I think, we can implement this feature as part of V2 APIs: #2171.

Users will be able to create TrainingRuntime using different Job Template for every PS:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: tf-ps-diversity
spec:
  numNodes: 5
  replicatedJobs:
    - name: PS-GPU-V100
       template:
       ....
    - name: PS-GPU-H100
       template:
       ....
    - name: Worker
       template:
       ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants