Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split head memory and cpu requests/limits #579

Merged

Conversation

Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Jul 2, 2024

Issue link

Closes: RHOAIENG-9259

What changes have been made

  • Split the head cpu and memory resources to requests/limits similar to update SDK args #547
  • Added depreciation warnings to the old vars head_cpus and head_memory
  • Updated head/worker_extended_resource_request to include string values due to failing get_cluster method
  • Updated notebook WF tests to reflect new parameters
  • Updated existing e2e tests with new Parameters
  • Added documentation for depreciating variables

Verification steps

Setup

Notebook server ODH/RHOAI/Local

  • Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
  • Checkout this PR's branch
  • Run poetry build - install if needed (pip install poetry)
  • Run pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
  • Restart your notebook kernel

Testing

Testing the depreciating args head_cpus and head_memory

Follow through the basic Ray demo. Set the head_cpus and head_memory parameters to a value of your choosing.
You should get a warning that the parameters are being depreciated and to use the new ones.

The head cpu requests and limits should both equate the values you entered for the above.

Testing the new requests/limits args

In the ClusterConfiguration add the parameters

  • head_cpu_requests
  • head_cpu_limits
  • head_memory_requests
  • head_memory_limits

Set them to values of your choosing and the head pod of the Ray Cluster should reflect these values.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link
Collaborator

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one thing to note, perhaps unrelated to this PR, and is that a user can basically input ANY value in the ClusterConfiguration parameters.

I.e., I can set to head_cpu_requests=True, including head_gpus=True, and that is reflected in the yaml as a bool. I believe this is not the expected behaviour. - Note that this was tested on KinD as my OpenShift cluster isn't working at the moment.

@Bobbins228
Copy link
Contributor Author

@ChristianZaccaria
This is not expected behaviour at all :(
I can have a look at adding some validation to ensure that the head/worker requests/limits are of the correct type.
Good catch!

@ChristianZaccaria
Copy link
Collaborator

@Bobbins228 I couldn't get further, but I suppose maybe cluster.up() will already capture that and throw an error for using the wrong datatypes. However, you're right, there seems to be no validation when creating the yaml file.

@Bobbins228
Copy link
Contributor Author

@ChristianZaccaria This is insane! It seems you can pretty much set any of the variables to whatever type you like.
I will create a Jira for fixing the validation on all ClusterConfiguration parameters.

@Bobbins228 Bobbins228 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2024
@Bobbins228
Copy link
Contributor Author

Applied do not merge label until RHOAIENG-9259 is a priority again.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2024
@Bobbins228 Bobbins228 removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2024
Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just some docs changes

docs/cluster-configuration.md Outdated Show resolved Hide resolved
docs/cluster-configuration.md Outdated Show resolved Hide resolved
docs/cluster-configuration.md Outdated Show resolved Hide resolved
@Bobbins228
Copy link
Contributor Author

/retest

Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2024
Copy link
Contributor

openshift-ci bot commented Sep 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 1235fc8 into project-codeflare:main Sep 19, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants