[feature] Support different PS/worker types #1369

gaocegege · 2021-08-17T03:06:33Z

In some customer cases, users want to schedule one PS for one GPU machine, and place other PSes in CPU machines, like this:

  tfReplicaSpecs:
    PS-1:
      replicas: 3
      template:
        spec:
          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                 matchExpressions:
                 - key: gpu-type
                    operator: In
                    values:
                    - true
               topologyKey: topology.kubernetes.io/zone
          containers:
            - name: tensorflow
              image: xxx
    PS-2:
      replicas: 5
      template:
        spec:
          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                 matchExpressions:
                 - key: gpu-type
                    operator: In
                    values:
                    - false
               topologyKey: topology.kubernetes.io/zone
          containers:
            - name: tensorflow
              image: xxx

/cc @zw0610

zw0610 · 2021-08-17T03:08:17Z

I've implemented this feature for internal use. If the community believe this feature is desirable, let me push the changes to this repository.

gaocegege · 2021-08-17T03:11:40Z

It is widely needed in the industry, I think.

/cc @kubeflow/wg-training-leads

johnugeorge · 2021-08-17T03:27:20Z

Interesting. Can you give one example here? When would a user prefer GPU for some while CPU. for other PS?

zw0610 · 2021-08-17T03:40:57Z

Interesting. Can you give one example here? When would a user prefer GPU for some while CPU. for other PS?

I can offer two cases:

Case 1: Diversity on ParameterServer

For a PS/Worker distributed training task, as we wish to maximize the communication bandwidth between workers and parameter servers, once possible, we shall co-allocate workers and parameter servers. However, there might leave some parameter servers that cannot fit into node as the resource on a node is limited. In this case, we need to configure parameter servers with distinguished pod affinity.

Case 2: Diversity on Worker

When (tf/mpi/pytorch-)jobs are deployed in a cluster with fragmented resource left, such like 2c4g on this node or 4c2g on that node. With the traditional TFJob specification, these resources shall be permanently abandoned. However, a diverse TFJob can configure workers with multiple resource configuration so any left fragmented resources can be utilized by a worker in any diverse TFJobs.

Of source, to run a worker on a diverse TFJob, optimizer in TensorFlow should be carefully chosen to perform asynchronized gradient updating as well as learning rate adjusting for workers with various batch size.

gaocegege · 2021-08-17T03:41:08Z

When would a user prefer GPU for some while CPU. for other PS?

It's more about scheduling. For example, there is a GPU node with 2 GPU, 64 CPU, and 126G memory. One work uses 1 GPU, 25 CPU, and 40G. Then there can be another PS that uses 14 CPU, 46G memory.

Besides this, there are some CPU nodes with 16 CPU, 32G. Then PSes in CPU nodes should use 16 CPU, 32G.

gaocegege · 2021-08-17T06:24:15Z

�Maybe we could wait until all-in-one is released.

johnugeorge · 2021-08-17T09:12:19Z

Got it. Because of the resource allocation, need arises

terrytangyuan · 2021-08-17T14:54:35Z

This would require a customized optimizer and dynamic batch sizing algorithm though. I am curious to see if there's any good practices around this from your internal experiments.

zw0610 · 2021-08-23T03:03:29Z

This would require a customized optimizer and dynamic batch sizing algorithm though. I am curious to see if there's any good practices around this from your internal experiments.

Definitely, a new optimizer that can cope with this volatile environment is required. But diverse parameter server mode, a regular optimizer supporting asyncrhonized updating is sufficient. dynamic bs adjusting mainly comes from the diverse worker mode.

Regarding experiments, we probably need to provide this feature in tf-operator as the experiment environment for algorithm researchers to proceed.

WEICHINLIN · 2021-11-22T04:37:45Z

Has this function been implemented?
I want to know how to implement this function.

zw0610 · 2021-11-22T04:54:33Z

Has this function been implemented? I want to know how to implement this function.

I only have an implementation for poc based on v0.5.3: https://github.com/zw0610/tf-operator/tree/diverse-worker

gaocegege · 2021-11-22T06:21:03Z

Has this function been implemented? I want to know how to implement this function.

Can you please describe your use scenario? Maybe there is another workaround.

WEICHINLIN · 2021-11-22T08:47:02Z

My previous expression is not correct.
I want to understand the purpose of this ISSUE.
Including whether the above Yaml file can be executed, and the problem ISSUE want to solve.
I didn't quite understand after reading the previous comments.

gaocegege · 2021-11-22T10:37:23Z

We want to support heterogeneous PS/worker.

For example, some workers use CPU to train while some others use GPU.

WEICHINLIN · 2022-01-08T07:05:40Z

This means that distributed training has both ps task and work task, so that cpu and gpu can act on training at the same time?

stale · 2022-04-16T19:57:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Windfarer · 2024-07-29T01:06:53Z

Any update about this issue? Heterogeneous workers have become increasingly important in the current training of LLMs.

gaocegege · 2024-07-30T01:47:48Z

I do not think it is in the roadmap.

andreyvelich · 2024-07-30T16:25:11Z

@gaocegege @Windfarer I think, we can implement this feature as part of V2 APIs: #2171.

Users will be able to create TrainingRuntime using different Job Template for every PS:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: tf-ps-diversity
spec:
  numNodes: 5
  replicatedJobs:
    - name: PS-GPU-V100
       template:
       ....
    - name: PS-GPU-H100
       template:
       ....
    - name: Worker
       template:
       ...

gaocegege added the kind/feature label Aug 17, 2021

stale bot added the lifecycle/stale label Apr 16, 2022

stale bot closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Support different PS/worker types #1369

[feature] Support different PS/worker types #1369

gaocegege commented Aug 17, 2021

zw0610 commented Aug 17, 2021

gaocegege commented Aug 17, 2021

johnugeorge commented Aug 17, 2021

zw0610 commented Aug 17, 2021 •

edited

Loading

gaocegege commented Aug 17, 2021

gaocegege commented Aug 17, 2021 •

edited

Loading

johnugeorge commented Aug 17, 2021

terrytangyuan commented Aug 17, 2021

zw0610 commented Aug 23, 2021

WEICHINLIN commented Nov 22, 2021

zw0610 commented Nov 22, 2021

gaocegege commented Nov 22, 2021 •

edited

Loading

WEICHINLIN commented Nov 22, 2021

gaocegege commented Nov 22, 2021

WEICHINLIN commented Jan 8, 2022

stale bot commented Apr 16, 2022

Windfarer commented Jul 29, 2024

gaocegege commented Jul 30, 2024

andreyvelich commented Jul 30, 2024

[feature] Support different PS/worker types #1369

[feature] Support different PS/worker types #1369

Comments

gaocegege commented Aug 17, 2021

zw0610 commented Aug 17, 2021

gaocegege commented Aug 17, 2021

johnugeorge commented Aug 17, 2021

zw0610 commented Aug 17, 2021 • edited Loading

gaocegege commented Aug 17, 2021

gaocegege commented Aug 17, 2021 • edited Loading

johnugeorge commented Aug 17, 2021

terrytangyuan commented Aug 17, 2021

zw0610 commented Aug 23, 2021

WEICHINLIN commented Nov 22, 2021

zw0610 commented Nov 22, 2021

gaocegege commented Nov 22, 2021 • edited Loading

WEICHINLIN commented Nov 22, 2021

gaocegege commented Nov 22, 2021

WEICHINLIN commented Jan 8, 2022

stale bot commented Apr 16, 2022

Windfarer commented Jul 29, 2024

gaocegege commented Jul 30, 2024

andreyvelich commented Jul 30, 2024

zw0610 commented Aug 17, 2021 •

edited

Loading

gaocegege commented Aug 17, 2021 •

edited

Loading

gaocegege commented Nov 22, 2021 •

edited

Loading