-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] compatibility test for the nightly Ray image fails #1055
Conversation
2a3a3bf
to
d7756e2
Compare
8597764
to
97cc682
Compare
(1) In most cases, (2) Without #1036, all Pods in the cluster will crash (head Pod crashes once, each worker Pod crashes twice). With #1036, only head Pod will crash. This PR relieves the flakiness, but it is still flaky. See the section "Experiments for 73b37b1" for more details. In GitHub Actions, it passes twice consecutively without any retry. That's why I think this PR is ready to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! If I understand correctly, this PR takes the test from consistently failing to merely flaky. Feel free to keep the original flakiness issue open, or open a new issue for the remaining flakiness.
…#1055) compatibility test for the nightly Ray image fails
Why are these changes needed?
Use HTTP request to verify serve deployment after the cluster recovers from a failure to fix [Serve] Cannot get serve deployment after a RayCluster recovers ray#34799.
For Ray 2.1.0, containers require tens of seconds to become "READY" after the Pod is running. In addition, the serve deployment takes few seconds to be ready to serve requests after all containers are "READY".
With [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036, the worker Pods will not crash after the GCS server process is killed. Hence, the output of
test_detached_actor_2.py
may be different based on the actor is assigned to the head or a worker.test_detached_actor_1.py
callsincrement()
twice, andtest_detached_actor_2.py
callsincrement()
once. Hence, if the actor is scheduled on a worker, the output oftest_detached_actor_2.py
should be 3. On the other hand, the output should be 1 if it is scheduled on the head Pod.num-cpus: 0
to prevent the actor from scheduling on the head.test_detached_actor_2.py
(assert(val == 3)
).ray_namespace
.Update
rayStartParams
inray-service.yaml.template
. Worker Pod withnode-ip-address: $$MY_POD_IP
cannot connect to the head (See [Bug] Job Sample YAMLray_v1alpha1_rayjob.yaml
fails with emptynode-ip-address
$MY_POD_IP
#805 for more details.)Connect to GCS (
ray.init()
) rather than Ray client (10001) ([Feature] Connect to RayCluster via GCS port rather than Ray client in compatibility test #848).ray.init()
instead of Ray client, the tests become very unstable.(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.
Experiments for 73b37b1
RayFTTestCase
25 times on my devbox.test_detached_actor
never fails.test_ray_serve
fails 6 times.test_ray_serve_1.py * 5
test_ray_serve_2.py * 1
The
test_ray_serve_1.py
's error message is from the link. The reason seems to be failing to get HTTPProxy actors which is similar to my observation above "(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.". I will file an issue later, but the check seems to be legacy code.Related issue number
Closes #1053
#848
Closes ray-project/ray#34799
Checks