Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. #638

Closed
2 tasks done
kevin85421 opened this issue Oct 17, 2022 · 0 comments · Fixed by #1341
Closed
2 tasks done
Assignees
Labels
bug Something isn't working

Comments

@kevin85421
Copy link
Member

kevin85421 commented Oct 17, 2022

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ci

What happened + What you expected to happen

  • rayproject/ray:nightly
    • When the gcs_server process on the head pod is killed, the head pod will be deleted by KubeRay operator and then create a new one. (Link1, Link2)
  • rayproject/ray:2.0.0
    • When the gcs_server process on the head pod is killed, the head pod will be restarted based on the pod's restartPolicy.

Both restartPolicy and KubeRay couple together. We need to have a further discussion about them.

Reproduction script

This cannot be reproduced every time.

RAY_IMAGE=rayproject/ray:nightly python3 tests/compatibility-test.py RayFTTestCase.test_detached_actor 2>&1 | tee log

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the bug Something isn't working label Oct 17, 2022
@kevin85421 kevin85421 self-assigned this Oct 17, 2022
DmitriGekhtman pushed a commit that referenced this issue Dec 1, 2022
…configuration framework (#759)

Refactors for integration tests --

Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

Skip test_kill_head due to

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. #638
[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed  #634.
Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

Refactor and relieve flakiness of test_ray_serve_work

working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged.
Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
@kevin85421 kevin85421 added this to the v0.5.0 release milestone Dec 1, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this issue Sep 24, 2023
…configuration framework (ray-project#759)

Refactors for integration tests --

Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested.

Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes.

Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

Skip test_kill_head due to

[Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638
[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed  ray-project#634.
Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER.

Refactor and relieve flakiness of test_ray_serve_work

working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template.
To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT.
When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged.
Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me.
Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase.

Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant