-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Support adding custom accelerator to resources in rayStartParams #2425
[Refactor] Support adding custom accelerator to resources in rayStartParams #2425
Conversation
if resourcesMap != nil && !isCustomAcceleratorResourceAdded { | ||
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() { | ||
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil { | ||
log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return an error instead and surface the log in generateRayStartCommand? This way you don't need to pass the logger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning error, with out going through all the entries of resourceLimits, might cause missing the num-gpus
parsing, if its later in the entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error, I think we should return it and propagate it to Reconcile
to surface the issue immediately. If we continue with the faulty CR spec, the behavior will be undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the behavior to panic on any error while marshalling/unmarshalling the resources string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should panic, that will cause the entire process to crash and impact other resources that don't have issues serializing custom accelerators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 what do you think?
@andrewsykim as far as I have tested, if you specify bad resources string in the yaml, it does not get applied to the cluster and fails with error says its not a valid object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the issue with returning an error here instead of panicing though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim return error
where? Do you mean just return error and log it, instead of panicing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where the confusion is. Typically we are following the pattern of returning and propagating errors to Reconcile to ensure errors are requeued by controller-runtime. However, I see that returning an error here would require a much larger refactor since generateRayStartParam
and the BuildPod
both do not return an error.
I suggest for now we go back to logging the error instead of panicing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it back to log error, instead of panicing.
resourcesMap, err := getResourcesMap(rayStartParams) | ||
if err != nil { | ||
return err | ||
func addWellKnownAcceleratorResources(log logr.Logger, rayStartParams map[string]string, resourceLimits corev1.ResourceList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you pass ctx
instead of logr.Logger
and retrieve the logger from ctx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the logger, it was initially added as the code was not throwing an error, would help for debugging.
if resourcesMap != nil && !isCustomAcceleratorResourceAdded { | ||
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() { | ||
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil { | ||
log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid using fmt.Sprintf
in log function. You can check here for the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
if err != nil { | ||
return err | ||
func addWellKnownAcceleratorResources(log logr.Logger, rayStartParams map[string]string, resourceLimits corev1.ResourceList) { | ||
resourcesMap, _ := getResourcesMap(rayStartParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check for the error returned by getResourcesMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the code to check and panic
@@ -9,6 +9,8 @@ import ( | |||
"strconv" | |||
"strings" | |||
|
|||
"github.com/go-logr/logr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the import and retrieve the logger from ctx
instead.
} | ||
|
||
// Add the first encountered custom accelerator resource from the resource limits to the rayStartParams if not already present | ||
if resourcesMap != nil && !isCustomAcceleratorResourceAdded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getResourceMap
should promise not to return nil
if err
is not nil. In that case, we don't need to check resourcesMap != nil
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats right, getResourceMap
either returns value if err is not nil
and returns value = nil
incase error is not nil
.
if resourcesMap != nil && !isCustomAcceleratorResourceAdded { | ||
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() { | ||
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil { | ||
log.Error(err, fmt.Sprintf("failed to add %s to resources", rayResourceName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error, I think we should return it and propagate it to Reconcile
to surface the issue immediately. If we continue with the faulty CR spec, the behavior will be undefined.
// Add the first encountered custom accelerator resource from the resource limits to the rayStartParams if not already present | ||
if resourcesMap != nil && !isCustomAcceleratorResourceAdded { | ||
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() { | ||
if err := addCustomAcceleratorToResourcesIfNotExists(rayStartParams, resourcesMap, rayResourceName, resourceValue.Value()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid updating rayStartParams
and resourcesMap
in addCustomAcceleratorToResourcesIfNotExists
, especially since this loop also checks resourcesMap
? Maybe addCustomAcceleratorToResourcesIfNotExists
is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the function entirely and added the code in the same function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Left some nit comments.
if !isCustomAcceleratorResourceAdded { | ||
if rayResourceName, ok := customAcceleratorToRayResourceMap[resourceKeyString]; ok && !resourceValue.IsZero() { | ||
if _, exists := resourcesMap[rayResourceName]; !exists { | ||
resourcesMap[rayResourceName] = float64(resourceValue.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use resourceValue.AsApproximateFloat64
instead? I guess the precision loss will be less than converting to int64 first and then to float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, ok to open another PR for this?
return fmt.Errorf("failed to marshal resources map to string: %w", err) | ||
} | ||
|
||
rayStartParams["resources"] = fmt.Sprintf("'%s'", updatedResourcesStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change from string(updatedResourcesStr)
to fmt.Sprintf("'%s'", updatedResourcesStr)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray start
cmd expects the resources string to passed in the below format(with quotes around the string)
ray start --head --num-cpus=3 --num-gpus=4 --resources='{"special_hardware": 1, "custom_label": 1}'
Ref: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#specifying-node-resources
@kevin85421 can you merge the PR too? |
@mounchin I saw a test failure that seems related to this PR:
Can you take a look please? |
|
@andrewsykim let me know your thoughts on ^^ |
I think a common pattern is to sort the actual output before comparing to the expected. Can we do that for the generated Ray start command? |
But that wont work here, because based on the |
This seems like the most viable option then |
@andrewsykim please review this PR #2464 |
Why are these changes needed?
Related issue number
ray-project/ray#44361
Checks