Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] A mini RayService requires more than 90 seconds to converge #838

Closed
2 tasks done
kevin85421 opened this issue Dec 13, 2022 · 4 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working rayservice stability Pertains to basic infrastructure stability

Comments

@kevin85421
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

See #837 for more details.

test_sample_rayservice_yamls.py is very flaky (~20% to fail). Previously, it was very stable. #731 said that "I have run more than ten times on my cluster, and the result is always pass.".

  • [root cause]: RayServiceAddCREvent fails to converge due to the absence of rayservice-sample-serve-svc in 90 seconds. It is too slow.

Reproduction script

Run the following command several times.

RAY_IMAGE=rayproject/ray:2.1.0 OPERATOR_IMAGE=kuberay/operator:nightly python3 tests/test_sample_rayservice_yamls.py

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the bug Something isn't working label Dec 13, 2022
@kevin85421 kevin85421 self-assigned this Dec 13, 2022
@kevin85421 kevin85421 added the stability Pertains to basic infrastructure stability label Dec 13, 2022
@kevin85421 kevin85421 added this to the v0.5.0 release milestone Dec 13, 2022
@kevin85421
Copy link
Member Author

@architkulkarni
Copy link
Contributor

Just recording the time it took when I ran this test on my laptop:
--- RayServiceAddCREvent 205.55406618118286 seconds ---

@kevin85421
Copy link
Member Author

It becomes stable after #1000. The RayService test passes 25 times consecutively.

@kevin85421
Copy link
Member Author

RAY_IMAGE=rayproject/ray:2.4.0 OPERATOR_IMAGE=kuberay/operator:nightly python3 tests/test_sample_rayservice_yamls.py

The RayServiceAddCREvent converges in 32 seconds. I guessed the root cause of this issue is RAY_IMAGE. The current test framework will not replace the RAY_IMAGE automatically, so the Pod will pull the image from DockerHub instead if RAY_IMAGE is not the same as the value of image in the YAML file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rayservice stability Pertains to basic infrastructure stability
Projects
None yet
Development

No branches or pull requests

3 participants