Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster status safeguards #254

Merged
merged 1 commit into from
Aug 7, 2023

Conversation

Maxusmusti
Copy link
Collaborator

@Maxusmusti Maxusmusti commented Jul 26, 2023

Issue link

#250

What changes have been made

Added error handling to prevent cluster.status/wait_ready crashes due to missing AW status

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@kpouget
Copy link

kpouget commented Jul 26, 2023

@Maxusmusti , is there an easy way to test this PR? ideally, something I could put there to confirm that it does the trick?

@MichaelClifford
Copy link
Collaborator

Tested it in my ODH environment and appears to overcome the issue with wait_ready() failing

@Maxusmusti
Copy link
Collaborator Author

@kpouget This should hopefully fix the MissingModel issue you were running into. So immediately running cluster.up() -> cluster.status() or cluster.up() -> cluster.wait_ready() should never crash or error out on the second line. To give MCAD a bit more load, you can run:

cluster.up()
cluster.down()
cluster.up()
cluster.wait_ready()

and it should never crash throughout the wait_ready()

You can install this specific fork/branch in a couple of ways, either clone the repo (checkout correct branch) and run pip install -e . or you can pip install directly from git: https://stackoverflow.com/questions/20101834/pip-install-from-git-repo-branch

kpouget added a commit to kpouget/ci-artifacts that referenced this pull request Jul 27, 2023
kpouget added a commit to kpouget/ci-artifacts that referenced this pull request Jul 27, 2023
@kpouget
Copy link

kpouget commented Jul 28, 2023

@Maxusmusti, I cannot test this PR because #255 is blocking my automation.

If you want to try it while I'm away, make this comment in openshift-psap/ci-artifacts#876

test codeflare-light

then navigate to this path:
/artifacts/e2e/test/artifacts/000__sdk_user_run_many/000__local_ci__run_multi/ci-pods_artifacts/ci-pod-0/run.log
to have the logs of pod execution

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MichaelClifford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2023
@MichaelClifford
Copy link
Collaborator

/LGTM

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2023
@MichaelClifford MichaelClifford merged commit f34697a into project-codeflare:main Aug 7, 2023
2 checks passed
kpouget added a commit to kpouget/topsail that referenced this pull request Aug 14, 2023
kpouget added a commit to kpouget/topsail that referenced this pull request Aug 14, 2023
kpouget added a commit to kpouget/topsail that referenced this pull request Aug 14, 2023
kpouget added a commit to kpouget/topsail that referenced this pull request Aug 25, 2023
kpouget added a commit to kpouget/topsail that referenced this pull request Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants