[Refactor] Support adding custom accelerator to resources in rayStartParams #2425

mounchin · 2024-10-03T18:27:19Z

Why are these changes needed?

This PR is a followup to the Add support for parsing neuron core resource limit and pass it as ray… #2409 to support adding to resources passed as part of rayStartParams
- Add support for parsing neuron core resource limit and pass it as ray… #2409 (comment)

Related issue number

ray-project/ray#44361

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

…Params

ray-operator/controllers/ray/common/pod.go

andrewsykim · 2024-10-03T19:26:59Z

ray-operator/controllers/ray/common/pod.go

+		if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
+			if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
+				if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
+					log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName))


Can we return an error instead and surface the log in generateRayStartCommand? This way you don't need to pass the logger

Returning error, with out going through all the entries of resourceLimits, might cause missing the num-gpus parsing, if its later in the entries.

If there is an error, I think we should return it and propagate it to Reconcile to surface the issue immediately. If we continue with the faulty CR spec, the behavior will be undefined.

Changed the behavior to panic on any error while marshalling/unmarshalling the resources string.

I don't think we should panic, that will cause the entire process to crash and impact other resources that don't have issues serializing custom accelerators

@kevin85421 what do you think?

@andrewsykim as far as I have tested, if you specify bad resources string in the yaml, it does not get applied to the cluster and fails with error says its not a valid object.

What's the issue with returning an error here instead of panicing though?

@andrewsykim return error where? Do you mean just return error and log it, instead of panicing?

I see where the confusion is. Typically we are following the pattern of returning and propagating errors to Reconcile to ensure errors are requeued by controller-runtime. However, I see that returning an error here would require a much larger refactor since generateRayStartParam and the BuildPod both do not return an error.

I suggest for now we go back to logging the error instead of panicing

Changed it back to log error, instead of panicing.

ray-operator/controllers/ray/common/pod_test.go

kevin85421 · 2024-10-04T05:55:15Z

ray-operator/controllers/ray/common/pod.go

-	resourcesMap, err := getResourcesMap(rayStartParams)
-	if err != nil {
-		return err
+func addWellKnownAcceleratorResources(log logr.Logger, rayStartParams map[string]string, resourceLimits corev1.ResourceList) {


Could you pass ctx instead of logr.Logger and retrieve the logger from ctx?

Removed the logger, it was initially added as the code was not throwing an error, would help for debugging.

kevin85421 · 2024-10-04T06:02:50Z

ray-operator/controllers/ray/common/pod.go

+		if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
+			if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
+				if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
+					log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName))


Please avoid using fmt.Sprintf in log function. You can check here for the reason.

kevin85421 · 2024-10-04T06:05:15Z

ray-operator/controllers/ray/common/pod.go

-	if err != nil {
-		return err
+func addWellKnownAcceleratorResources(log logr.Logger, rayStartParams map[string]string, resourceLimits corev1.ResourceList) {
+	resourcesMap, _ := getResourcesMap(rayStartParams)


Do we need to check for the error returned by getResourcesMap?

Changed the code to check and panic

kevin85421 · 2024-10-04T06:07:43Z

ray-operator/controllers/ray/common/pod.go

@@ -9,6 +9,8 @@ import (
 	"strconv"
 	"strings"

+	"github.com/go-logr/logr"


Remove the import and retrieve the logger from ctx instead.

kevin85421 · 2024-10-04T06:22:34Z

ray-operator/controllers/ray/common/pod.go

+		}
+
+		// Add the first encountered custom accelerator resource from the resource limits to the rayStartParams if not already present
+		if resourcesMap != nil && !isCustomAcceleratorResourceAdded {


I think getResourceMap should promise not to return nil if err is not nil. In that case, we don't need to check resourcesMap != nil here.

Yes thats right, getResourceMap either returns value if err is not nil and returns value = nil incase error is not nil.

kevin85421 · 2024-10-04T06:29:25Z

ray-operator/controllers/ray/common/pod.go

+		if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
+			if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
+				if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
+					log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName))


If there is an error, I think we should return it and propagate it to Reconcile to surface the issue immediately. If we continue with the faulty CR spec, the behavior will be undefined.

kevin85421 · 2024-10-04T06:37:52Z

ray-operator/controllers/ray/common/pod.go

+		// Add the first encountered custom accelerator resource from the resource limits to the rayStartParams if not already present
+		if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
+			if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
+				if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {


Can you avoid updating rayStartParams and resourcesMap in addCustomAcceleratorToResourcesIfNotExists, especially since this loop also checks resourcesMap? Maybe addCustomAcceleratorToResourcesIfNotExists is not necessary.

Removed the function entirely and added the code in the same function

kevin85421

Overall LGTM! Left some nit comments.

kevin85421 · 2024-10-08T18:22:51Z

ray-operator/controllers/ray/common/pod.go

+		if !isCustomAcceleratorResourceAdded {
+			if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
+				if _, exists := resourcesMap[rayResourceName]; !exists {
+					resourcesMap[rayResourceName] = float64(resourceValue.Value())


Maybe use resourceValue.AsApproximateFloat64 instead? I guess the precision loss will be less than converting to int64 first and then to float64.

Sure, ok to open another PR for this?

kevin85421 · 2024-10-08T18:28:33Z

ray-operator/controllers/ray/common/pod.go

+						return fmt.Errorf("failed to marshal resources map to string: %w", err)
+					}
+
+					rayStartParams["resources"] = fmt.Sprintf("'%s'", updatedResourcesStr)


why change from string(updatedResourcesStr) to fmt.Sprintf("'%s'", updatedResourcesStr)?

The ray start cmd expects the resources string to passed in the below format(with quotes around the string)

ray start --head --num-cpus=3 --num-gpus=4 --resources='{"special_hardware": 1, "custom_label": 1}'

Ref: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#specifying-node-resources

mounchin · 2024-10-08T19:34:23Z

@kevin85421 can you merge the PR too?

andrewsykim · 2024-10-16T14:37:36Z

@mounchin I saw a test failure that seems related to this PR:

    --- FAIL: TestGenerateRayStartCommand/HeadNode_with_multiple_custom_accelerators (0.00s)
        pod_test.go:1337: 
            	Error Trace:	/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/common/pod_test.go:1337
            	Error:      	Not equal: 
            	            	expected: "ray start --head  --num-gpus=1  --resources='{\"tpu\":8}' "
            	            	actual  : "ray start --head  --num-gpus=1  --resources='{\"neuron_cores\":4}' "
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-ray start --head  --num-gpus=1  --resources='{"tpu":8}' 
            	            	+ray start --head  --num-gpus=1  --resources='{"neuron_cores":4}'

Can you take a look please?

mounchin · 2024-10-16T17:35:35Z

@mounchin I saw a test failure that seems related to this PR:

    --- FAIL: TestGenerateRayStartCommand/HeadNode_with_multiple_custom_accelerators (0.00s)
        pod_test.go:1337: 
            	Error Trace:	/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/common/pod_test.go:1337
            	Error:      	Not equal: 
            	            	expected: "ray start --head  --num-gpus=1  --resources='{\"tpu\":8}' "
            	            	actual  : "ray start --head  --num-gpus=1  --resources='{\"neuron_cores\":4}' "
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-ray start --head  --num-gpus=1  --resources='{"tpu":8}' 
            	            	+ray start --head  --num-gpus=1  --resources='{"neuron_cores":4}'

Can you take a look please?

This is due to the fact that the code loops over the resourceLimits and it being a map, the code just picks the first custom accelerator resource and adds it to the ray start command.
With this being said, there are three options for us
- Have a consistent behavior on generating this command on multiple custom accelerators, by sorting the resourceKeys before looping
- Add two expected outputs in the test and assert that it should match either one of them
- Remove the test, as its not a possibility to have two custom accelerator types in one instance.
Let me know what you think?

mounchin · 2024-10-17T21:59:35Z

@andrewsykim let me know your thoughts on ^^

andrewsykim · 2024-10-17T22:05:42Z

I think a common pattern is to sort the actual output before comparing to the expected. Can we do that for the generated Ray start command?

mounchin · 2024-10-17T22:08:09Z

But that wont work here, because based on the resourceLimits entry encountered by the code in the loop, the ray start command gets generated.

andrewsykim · 2024-10-22T15:27:03Z

Have a consistent behavior on generating this command on multiple custom accelerators, by sorting the resourceKeys before looping

This seems like the most viable option then

mounchin · 2024-10-22T16:59:48Z

Have a consistent behavior on generating this command on multiple custom accelerators, by sorting the resourceKeys before looping

This seems like the most viable option then

@andrewsykim please review this PR #2464

[Refactor] Support adding custom accelerator to resources in rayStart…

64c07f2

…Params

andrewsykim reviewed Oct 3, 2024

View reviewed changes

ray-operator/controllers/ray/common/pod.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/common/pod.go Outdated Show resolved Hide resolved

Organize imports and rename function

2f3bfc9

mounchin marked this pull request as ready for review October 3, 2024 19:24

mounchin requested a review from andrewsykim October 3, 2024 19:24

andrewsykim reviewed Oct 3, 2024

View reviewed changes

andrewsykim requested a review from DmitriGekhtman October 3, 2024 19:27

andrewsykim reviewed Oct 3, 2024

View reviewed changes

ray-operator/controllers/ray/common/pod_test.go Show resolved Hide resolved

kevin85421 requested changes Oct 4, 2024

View reviewed changes

kevin85421 self-assigned this Oct 4, 2024

kevin85421 changed the title ~~[Refactor] Support adding custom accelerator to resources in rayStart…~~ [Refactor] Support adding custom accelerator to resources in rayStartParams Oct 4, 2024

mounchin added 3 commits October 7, 2024 09:33

Address comments and panic on error

8478742

Add test for multiple custom accelerators

30b2ab7

Wrap resources value with quotes while passing to ray start

aefb0cd

mounchin requested a review from kevin85421 October 7, 2024 19:45

mounchin added 2 commits October 7, 2024 17:53

Fix test case

3bf9c5c

Log error and continue, instead of panicing

77cbd7b

mounchin requested a review from andrewsykim October 8, 2024 17:01

kevin85421 approved these changes Oct 8, 2024

View reviewed changes

kevin85421 merged commit bf21d2d into ray-project:master Oct 8, 2024
27 checks passed

mounchin deleted the feature/support-ray-resources-update branch October 8, 2024 20:02

mounchin mentioned this pull request Oct 22, 2024

[Fix] Consistent parsing of custom accelerator resources #2464

Merged

4 tasks

andrewsykim mentioned this pull request Oct 28, 2024

feat: support calculate custom gpu resource #2477

Open

4 tasks

kevin85421 mentioned this pull request Oct 30, 2024

Update v6e-256 KubeRay Sample #2466

Merged

4 tasks

[Refactor] Support adding custom accelerator to resources in rayStartParams #2425

[Refactor] Support adding custom accelerator to resources in rayStartParams #2425

Conversation

mounchin commented Oct 3, 2024

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mounchin commented Oct 8, 2024

andrewsykim commented Oct 16, 2024

mounchin commented Oct 16, 2024 • edited Loading

mounchin commented Oct 17, 2024

andrewsykim commented Oct 17, 2024

mounchin commented Oct 17, 2024

andrewsykim commented Oct 22, 2024

mounchin commented Oct 22, 2024

mounchin commented Oct 16, 2024 •

edited

Loading