Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gang scheduling with Apache YuniKorn #2396

Merged
merged 15 commits into from
Sep 27, 2024
Merged

Conversation

yangwwei
Copy link
Contributor

@yangwwei yangwwei commented Sep 21, 2024

Add Gang Scheduling support with Apache YuniKorn.

This follows the doc from: https://yunikorn.apache.org/docs/next/user_guide/gang_scheduling/.

What's implemented in this PR

When yunikron is enabled, and RayCluster CR has label "ray.io/gang-scheduling-enabled" defined, this means the Gang scheduling is enabled for this RayCluster. The operator will add the following labels to pods, these are the task groups info attached to each pod, yunikorn.apache.org/task-groups:

{
 "name": "headgroup",
 "minMember": 1,
 "minResource": {
   "cpu": "100m",
   "memory": "50M"
  },
  "nodeSelector": {},
  "tolerations": [],
  "affinity": {},
  "topologySpreadConstraints": []
}, 
{
 "name": "worker-group",
 "minMember": 2,
 "minResource": {
   "cpu": "100m",
   "memory": "50M"
  },
  "nodeSelector": {},
  "tolerations": [],
  "affinity": {},
  "topologySpreadConstraints": []
}, 

and then the head pod in the headgroup will be assigned with one label:

yunikorn.apache.org/task-group-name=headgroup

and the worker pod in the worker-group will be assigned with one more label:

yunikorn.apache.org/task-group-name=worker-group

underneath, yunikorn will be able to differenciate the worker group pods from head group pods, and then count their minimal resources and num of members before scheduling them. This is to ensure yunikorn can reserve the slots for all these minimal number of pods before actually scheduling them.

To use this feature

Prerequisite:

  1. Start Ray operator with "--batch-scheduler=yunikorn"
  2. Install yunikorn https://yunikorn.apache.org/docs/#install

Steps

  1. Set the following labels in the Ray cluster object
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    ray.io/gang-scheduling-enabled: "true"
    yunikorn.apache.org/app-id: my-ray-cluster-0001
    yunikorn.apache.org/queue: root.default
    ...
  1. Create the Ray cluster on Kubernetes
  2. The Ray cluster will be scheduled in root.default queue with gang scheduling enforced.

I will put a doc PR to add a tutorial how to run this with Apache YuniKorn. This PR includes the implementation and UT. The logic is very simple, when the scheduler finds the gang scheduling is enabled, it will popluate the taskGroups info to the pod annotations. The task group format is:

[ 
  {
    "name":"headgroup",
    "minMember":1,
    "minResource": {"cpu":"1"},
    ...
  }, 
 { 
  "name":"workergroup",
  "minMember":1,
  "minResource":{"cpu":"1"},
  ...
 },
]

and head pod will belong to group "headgroup", the each worker group will have its own group.

@yangwwei
Copy link
Contributor Author

@kevin85421 pls help to review, thanks!

Comment on lines 63 to 64
y.populatePodLabels(app, pod, RayClusterApplicationIDLabelName, YuniKornPodApplicationIDLabelName)
y.populatePodLabels(app, pod, RayClusterQueueLabelName, YuniKornPodQueueLabelName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the label names need to be translated from RayClusterApplicationIDLabelName to YuniKornPodApplicationIDLabelName and from RayClusterQueueLabelName to YuniKornPodQueueLabelName.

Does Yunikorn read the labels on resources other than Pods? If not, why not set YuniKornPodApplicationIDLabelName to yunikorn.apache.org/app-id and YuniKornPodQueueLabelName to yunikorn.apache.org/queue and set them directly on RayCluster, and then apply the same labels to all the Pods that controlled by RayCluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caveat is, yunikorn depends on labels to identify apps, but they are currently using very simple format:

  • YuniKornPodApplicationIDLabelName: applicationId
  • YuniKornPodQueueLabelName: queue

but in Ray, I think it makes sense to add the domain name to the label, to indicate these lables are set for the scheduler.

  • yunikorn.apache.org/application-id
  • yunikorn.apache.org/queue-name

the code logic is very simple, just translate these labels and set them on pods, without introducing any inconsistency in Ray cluster spec. Hope that makes sense.

Copy link
Member

@MortalHappiness MortalHappiness Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I viewed the doc here

https://yunikorn.apache.org/docs/next/user_guide/labels_and_annotations_in_yunikorn

there are two labels yunikorn.apache.org/app-id and yunikorn.apache.org/queue. And applicationId and queue labels are legacy now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. Seems like the community is moving to a domain scoped label approach recently. Let me take a look, and do some tests, will update the PR once confims these labels works. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @MortalHappiness I just found these labels are only supported in v1.6 which was released just a few weeks ago. For better compatibility with some older version of yunikorn deployment, I think it's better to continue to support the legacy label names. However, I have modified RayClusterApplicationIDLabelName and RayClusterQueueLabelName in the package to be the same as yunikorn upstream. We can remove this translation layer and directly populate the labels to pods after a few releases. Does this sound good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add a TODO comment here to let us know we need to remove this in the future release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yangwwei Would you mind briefly adding comments for the context you mentioned in https://github.com/ray-project/kuberay/pull/2396/files#r1777710831?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment on lines 63 to 64
y.populatePodLabels(app, pod, RayClusterApplicationIDLabelName, YuniKornPodApplicationIDLabelName)
y.populatePodLabels(app, pod, RayClusterQueueLabelName, YuniKornPodQueueLabelName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yangwwei Would you mind briefly adding comments for the context you mentioned in https://github.com/ray-project/kuberay/pull/2396/files#r1777710831?

}
pod.Annotations[YuniKornTaskGroupsAnnotationName] = taskGroupsAnnotationValue

y.log.Info("Gang Scheduling enabled for RayCluster",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with Yunikorn. Can we infer that gang scheduling is enabled if the annotation yunikorn.apache.org/task-groups is set? If not, should we move this log to:

if y.isGangSchedulingEnabled(app) {
 ...
}

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not fully understand this. Basically "yunikorn.apache.org/task-groups" only needs to be set to Ray pods by the operator, it is internal metadata. We only use one flag "ray.io/gang-scheduling-enabled" in Ray to tell if we want to enable Gang scheduling or not for this Ray cluster. Only when enabled, the operator populates the task groups info to the pods, then yunikorn will honor and apply gang scheduling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it odd that we print the log saying "Gang Scheduling enabled for RayCluster" in populateTaskGroupsAnnotationToPod, a function that may also be used by other scheduling algorithms, which seems a bit strange.

That's why I’m asking if we should move the log to:

Copy code
if y.isGangSchedulingEnabled(app) {
 ...
}

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Basically populateTaskGroupsAnnotationToPod will ONLY be used for gang scheduling case, so it is safe to print the log inside of function. Also this function doesn't return any error, it only logs where or not gang scheduling is enabled (by checking if required task groups added to the pod).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarize this PR:

  1. Every head Pod or worker Pod has the same applicationId and queue label values if they belong to the same RayCluster CR.

  2. If gang scheduling is enabled, every head Pod or worker Pod has the same values of task-groups annotation if they belong to the same RayCluster CR. Pods belong to different Pod groups (head group + worker groups) have different values of task-group annotation.

Is this correct? In addition, what's the behavior if gang scheduling is not enabled?

@MortalHappiness
Copy link
Member

MortalHappiness commented Sep 27, 2024

I manually run it and it seems working.

Config:

  queues.yaml: |
    partitions:
      - name: default
        queues:
          - name: root
            queues:
              - name: test
                submitacl: "*"
                parent: false
                resources:
                  guaranteed:
                    memory: 6G
                    vcore: 4
                  max:
                    memory: 6G
                    vcore: 4

RayCluster:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: test-yunikorn-0
  labels:
    ray.io/gang-scheduling-enabled: "true"
    yunikorn.apache.org/app-id: test-yunikorn-0
    yunikorn.apache.org/queue: root.test
spec:
  rayVersion: "2.34.0"
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.34.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "1"
                memory: "2Gi"
  workerGroupSpecs:
    - groupName: worker
      rayStartParams: {}
      replicas: 2
      minReplicas: 2
      maxReplicas: 2
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.34.0
              resources:
                limits:
                  cpu: "1"
                  memory: "1Gi"
                requests:
                  cpu: "1"
                  memory: "1Gi"

Apply the RayCluster yaml above, and then replace test-yunikorn-0 with test-yunikorn-1 and then apply again.

image

@yangwwei
Copy link
Contributor Author

Summarize this PR:

  1. Every head Pod or worker Pod has the same applicationId and queue label values if they belong to the same RayCluster CR.
  2. If gang scheduling is enabled, every head Pod or worker Pod has the same values of task-groups annotation if they belong to the same RayCluster CR. Pods belong to different Pod groups (head group + worker groups) have different values of task-group annotation.

Is this correct? In addition, what's the behavior if gang scheduling is not enabled?

This is correct. Let me update the description of this PR too.
If gang scheduling is not enabled (i.e ray.io/gang-scheduling-enabled not set), then all task groups info will not be added to Ray pods, and the Ray cluster will still be scheduled, just like normal application, pods will get scheduled as long as there is resources in the queue and the cluster.

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think this PR can be merged if @kevin85421 's comments are resolved.

cc @andrewsykim

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last comment: #2396 (comment). Others LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants