Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Support adding custom accelerator to resources in rayStartParams #2425

Merged

Conversation

mounchin
Copy link
Contributor

@mounchin mounchin commented Oct 3, 2024

Why are these changes needed?

Related issue number

ray-project/ray#44361

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@mounchin mounchin marked this pull request as ready for review October 3, 2024 19:24
if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return an error instead and surface the log in generateRayStartCommand? This way you don't need to pass the logger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning error, with out going through all the entries of resourceLimits, might cause missing the num-gpus parsing, if its later in the entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, I think we should return it and propagate it to Reconcile to surface the issue immediately. If we continue with the faulty CR spec, the behavior will be undefined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the behavior to panic on any error while marshalling/unmarshalling the resources string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should panic, that will cause the entire process to crash and impact other resources that don't have issues serializing custom accelerators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 what do you think?

@andrewsykim as far as I have tested, if you specify bad resources string in the yaml, it does not get applied to the cluster and fails with error says its not a valid object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with returning an error here instead of panicing though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim return error where? Do you mean just return error and log it, instead of panicing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where the confusion is. Typically we are following the pattern of returning and propagating errors to Reconcile to ensure errors are requeued by controller-runtime. However, I see that returning an error here would require a much larger refactor since generateRayStartParam and the BuildPod both do not return an error.

I suggest for now we go back to logging the error instead of panicing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it back to log error, instead of panicing.

resourcesMap, err := getResourcesMap(rayStartParams)
if err != nil {
return err
func addWellKnownAcceleratorResources(log logr.Logger, rayStartParams map[string]string, resourceLimits corev1.ResourceList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pass ctx instead of logr.Logger and retrieve the logger from ctx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the logger, it was initially added as the code was not throwing an error, would help for debugging.

if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using fmt.Sprintf in log function. You can check here for the reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

if err != nil {
return err
func addWellKnownAcceleratorResources(log logr.Logger, rayStartParams map[string]string, resourceLimits corev1.ResourceList) {
resourcesMap, _ := getResourcesMap(rayStartParams)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for the error returned by getResourcesMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code to check and panic

@@ -9,6 +9,8 @@ import (
"strconv"
"strings"

"github.com/go-logr/logr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the import and retrieve the logger from ctx instead.

}

// Add the first encountered custom accelerator resource from the resource limits to the rayStartParams if not already present
if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getResourceMap should promise not to return nil if err is not nil. In that case, we don't need to check resourcesMap != nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats right, getResourceMap either returns value if err is not nil and returns value = nil incase error is not nil.

if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, I think we should return it and propagate it to Reconcile to surface the issue immediately. If we continue with the faulty CR spec, the behavior will be undefined.

// Add the first encountered custom accelerator resource from the resource limits to the rayStartParams if not already present
if resourcesMap != nil && !isCustomAcceleratorResourceAdded {
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid updating rayStartParams and resourcesMap in addCustomAcceleratorToResourcesIfNotExists, especially since this loop also checks resourcesMap? Maybe addCustomAcceleratorToResourcesIfNotExists is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the function entirely and added the code in the same function

@kevin85421 kevin85421 self-assigned this Oct 4, 2024
@kevin85421 kevin85421 changed the title [Refactor] Support adding custom accelerator to resources in rayStart… [Refactor] Support adding custom accelerator to resources in rayStartParams Oct 4, 2024
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Left some nit comments.

if !isCustomAcceleratorResourceAdded {
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() {
if _, exists := resourcesMap[rayResourceName]; !exists {
resourcesMap[rayResourceName] = float64(resourceValue.Value())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use resourceValue.AsApproximateFloat64 instead? I guess the precision loss will be less than converting to int64 first and then to float64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, ok to open another PR for this?

return fmt.Errorf("failed to marshal resources map to string: %w", err)
}

rayStartParams["resources"] = fmt.Sprintf("'%s'", updatedResourcesStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change from string(updatedResourcesStr) to fmt.Sprintf("'%s'", updatedResourcesStr)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ray start cmd expects the resources string to passed in the below format(with quotes around the string)

ray start --head --num-cpus=3 --num-gpus=4 --resources='{"special_hardware": 1, "custom_label": 1}'

Ref: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#specifying-node-resources

@mounchin
Copy link
Contributor Author

mounchin commented Oct 8, 2024

@kevin85421 can you merge the PR too?

@kevin85421 kevin85421 merged commit bf21d2d into ray-project:master Oct 8, 2024
27 checks passed
@mounchin mounchin deleted the feature/support-ray-resources-update branch October 8, 2024 20:02
@andrewsykim
Copy link
Collaborator

@mounchin I saw a test failure that seems related to this PR:

    --- FAIL: TestGenerateRayStartCommand/HeadNode_with_multiple_custom_accelerators (0.00s)
        pod_test.go:1337: 
            	Error Trace:	/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/common/pod_test.go:1337
            	Error:      	Not equal: 
            	            	expected: "ray start --head  --num-gpus=1  --resources='{\"tpu\":8}' "
            	            	actual  : "ray start --head  --num-gpus=1  --resources='{\"neuron_cores\":4}' "
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-ray start --head  --num-gpus=1  --resources='{"tpu":8}' 
            	            	+ray start --head  --num-gpus=1  --resources='{"neuron_cores":4}' 

Can you take a look please?

@mounchin
Copy link
Contributor Author

mounchin commented Oct 16, 2024

@mounchin I saw a test failure that seems related to this PR:

    --- FAIL: TestGenerateRayStartCommand/HeadNode_with_multiple_custom_accelerators (0.00s)
        pod_test.go:1337: 
            	Error Trace:	/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/common/pod_test.go:1337
            	Error:      	Not equal: 
            	            	expected: "ray start --head  --num-gpus=1  --resources='{\"tpu\":8}' "
            	            	actual  : "ray start --head  --num-gpus=1  --resources='{\"neuron_cores\":4}' "
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-ray start --head  --num-gpus=1  --resources='{"tpu":8}' 
            	            	+ray start --head  --num-gpus=1  --resources='{"neuron_cores":4}' 

Can you take a look please?

  • This is due to the fact that the code loops over the resourceLimits and it being a map, the code just picks the first custom accelerator resource and adds it to the ray start command.

  • With this being said, there are three options for us

    • Have a consistent behavior on generating this command on multiple custom accelerators, by sorting the resourceKeys before looping
    • Add two expected outputs in the test and assert that it should match either one of them
    • Remove the test, as its not a possibility to have two custom accelerator types in one instance.

    Let me know what you think?

@mounchin
Copy link
Contributor Author

@andrewsykim let me know your thoughts on ^^

@andrewsykim
Copy link
Collaborator

I think a common pattern is to sort the actual output before comparing to the expected. Can we do that for the generated Ray start command?

@mounchin
Copy link
Contributor Author

But that wont work here, because based on the resourceLimits entry encountered by the code in the loop, the ray start command gets generated.

@andrewsykim
Copy link
Collaborator

Have a consistent behavior on generating this command on multiple custom accelerators, by sorting the resourceKeys before looping

This seems like the most viable option then

@mounchin
Copy link
Contributor Author

Have a consistent behavior on generating this command on multiple custom accelerators, by sorting the resourceKeys before looping

This seems like the most viable option then

@andrewsykim please review this PR #2464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants