-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gang scheduling with Apache YuniKorn #2396
Conversation
@kevin85421 pls help to review, thanks! |
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler_test.go
Outdated
Show resolved
Hide resolved
y.populatePodLabels(app, pod, RayClusterApplicationIDLabelName, YuniKornPodApplicationIDLabelName) | ||
y.populatePodLabels(app, pod, RayClusterQueueLabelName, YuniKornPodQueueLabelName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the label names need to be translated from RayClusterApplicationIDLabelName
to YuniKornPodApplicationIDLabelName
and from RayClusterQueueLabelName
to YuniKornPodQueueLabelName
.
Does Yunikorn read the labels on resources other than Pods? If not, why not set YuniKornPodApplicationIDLabelName
to yunikorn.apache.org/app-id
and YuniKornPodQueueLabelName
to yunikorn.apache.org/queue
and set them directly on RayCluster
, and then apply the same labels to all the Pods that controlled by RayCluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the caveat is, yunikorn depends on labels to identify apps, but they are currently using very simple format:
YuniKornPodApplicationIDLabelName
: applicationIdYuniKornPodQueueLabelName
: queue
but in Ray, I think it makes sense to add the domain name to the label, to indicate these lables are set for the scheduler.
- yunikorn.apache.org/application-id
- yunikorn.apache.org/queue-name
the code logic is very simple, just translate these labels and set them on pods, without introducing any inconsistency in Ray cluster spec. Hope that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I viewed the doc here
https://yunikorn.apache.org/docs/next/user_guide/labels_and_annotations_in_yunikorn
there are two labels yunikorn.apache.org/app-id
and yunikorn.apache.org/queue
. And applicationId
and queue
labels are legacy now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch. Seems like the community is moving to a domain scoped label approach recently. Let me take a look, and do some tests, will update the PR once confims these labels works. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @MortalHappiness I just found these labels are only supported in v1.6 which was released just a few weeks ago. For better compatibility with some older version of yunikorn deployment, I think it's better to continue to support the legacy label names. However, I have modified RayClusterApplicationIDLabelName
and RayClusterQueueLabelName
in the package to be the same as yunikorn upstream. We can remove this translation layer and directly populate the labels to pods after a few releases. Does this sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add a TODO comment here to let us know we need to remove this in the future release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangwwei Would you mind briefly adding comments for the context you mentioned in https://github.com/ray-project/kuberay/pull/2396/files#r1777710831?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
y.populatePodLabels(app, pod, RayClusterApplicationIDLabelName, YuniKornPodApplicationIDLabelName) | ||
y.populatePodLabels(app, pod, RayClusterQueueLabelName, YuniKornPodQueueLabelName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangwwei Would you mind briefly adding comments for the context you mentioned in https://github.com/ray-project/kuberay/pull/2396/files#r1777710831?
} | ||
pod.Annotations[YuniKornTaskGroupsAnnotationName] = taskGroupsAnnotationValue | ||
|
||
y.log.Info("Gang Scheduling enabled for RayCluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with Yunikorn. Can we infer that gang scheduling is enabled if the annotation yunikorn.apache.org/task-groups
is set? If not, should we move this log to:
if y.isGangSchedulingEnabled(app) {
...
}
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not fully understand this. Basically "yunikorn.apache.org/task-groups" only needs to be set to Ray pods by the operator, it is internal metadata. We only use one flag "ray.io/gang-scheduling-enabled" in Ray to tell if we want to enable Gang scheduling or not for this Ray cluster. Only when enabled, the operator populates the task groups info to the pods, then yunikorn will honor and apply gang scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it odd that we print the log saying "Gang Scheduling enabled for RayCluster" in populateTaskGroupsAnnotationToPod
, a function that may also be used by other scheduling algorithms, which seems a bit strange.
That's why I’m asking if we should move the log to:
Copy code
if y.isGangSchedulingEnabled(app) {
...
}
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Basically populateTaskGroupsAnnotationToPod
will ONLY be used for gang scheduling case, so it is safe to print the log inside of function. Also this function doesn't return any error, it only logs where or not gang scheduling is enabled (by checking if required task groups added to the pod).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks!
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_task_groups.go
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_task_groups.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarize this PR:
-
Every head Pod or worker Pod has the same
applicationId
andqueue
label values if they belong to the same RayCluster CR. -
If gang scheduling is enabled, every head Pod or worker Pod has the same values of
task-groups
annotation if they belong to the same RayCluster CR. Pods belong to different Pod groups (head group + worker groups) have different values oftask-group
annotation.
Is this correct? In addition, what's the behavior if gang scheduling is not enabled?
I manually run it and it seems working. Config: queues.yaml: |
partitions:
- name: default
queues:
- name: root
queues:
- name: test
submitacl: "*"
parent: false
resources:
guaranteed:
memory: 6G
vcore: 4
max:
memory: 6G
vcore: 4 RayCluster: apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: test-yunikorn-0
labels:
ray.io/gang-scheduling-enabled: "true"
yunikorn.apache.org/app-id: test-yunikorn-0
yunikorn.apache.org/queue: root.test
spec:
rayVersion: "2.34.0"
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.34.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
workerGroupSpecs:
- groupName: worker
rayStartParams: {}
replicas: 2
minReplicas: 2
maxReplicas: 2
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.34.0
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "1"
memory: "1Gi" Apply the RayCluster yaml above, and then replace |
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler_test.go
Outdated
Show resolved
Hide resolved
This is correct. Let me update the description of this PR too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think this PR can be merged if @kevin85421 's comments are resolved.
cc @andrewsykim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last comment: #2396 (comment). Others LGTM
Add Gang Scheduling support with Apache YuniKorn.
This follows the doc from: https://yunikorn.apache.org/docs/next/user_guide/gang_scheduling/.
What's implemented in this PR
When yunikron is enabled, and RayCluster CR has label "ray.io/gang-scheduling-enabled" defined, this means the Gang scheduling is enabled for this RayCluster. The operator will add the following labels to pods, these are the task groups info attached to each pod,
yunikorn.apache.org/task-groups
:and then the head pod in the
headgroup
will be assigned with one label:and the worker pod in the
worker-group
will be assigned with one more label:underneath, yunikorn will be able to differenciate the worker group pods from head group pods, and then count their minimal resources and num of members before scheduling them. This is to ensure yunikorn can reserve the slots for all these minimal number of pods before actually scheduling them.
To use this feature
Prerequisite:
Steps
root.default
queue with gang scheduling enforced.I will put a doc PR to add a tutorial how to run this with Apache YuniKorn. This PR includes the implementation and UT. The logic is very simple, when the scheduler finds the gang scheduling is enabled, it will popluate the
taskGroups
info to the pod annotations. The task group format is:and head pod will belong to group "headgroup", the each worker group will have its own group.