-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes #1000
[Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes #1000
Conversation
@architkulkarni would you mind adding more details about the race condition bug in Ray? In addition, would you mind testing RayJob with this PR on your laptop? We can run more than 10 times to test the flakiness. I will test the RayService on my devbox. Thanks! |
Tested 10x locally, passed every time. # This script is used to rebuild the KubeRay operator image and load it into the Kind cluster.
# Step 0: Delete the existing Kind cluster
kind delete cluster
# Step 1: Create a Kind cluster
kind create cluster --image=kindest/node:v1.24.0
# Step 2: Modify KubeRay source code
# For example, add a log "Hello KubeRay" in the function `Reconcile` in `raycluster_controller.go`.
# Step 3: Build a Docker image
# This command will copy the source code directory into the image, and build it.
# Command: IMG={IMG_REPO}:{IMG_TAG} make docker-build
IMG=kuberay/operator:nightly make docker-build
# Step 4: Load the custom KubeRay image into the Kind cluster.
# Command: kind load docker-image {IMG_REPO}:{IMG_TAG}
kind load docker-image kuberay/operator:nightly
# Step 5: Keep consistency
# If you update RBAC or CRD, you need to synchronize them.
# See the section "Consistency check" for more information.
# Step 6: Install KubeRay operator with the custom image via local Helm chart
# (Path: helm-chart/kuberay-operator)
# Command: helm install kuberay-operator --set image.repository={IMG_REPO} --set image.tag={IMG_TAG} .
helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly ~/kuberay/helm-chart/kuberay-operator
# Step 7: Submit a RayJob
kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
# Step 8: Check the status of the pods
watch kubectl get pods
# When the pods are ready, check `kubectl logs` to see that the job has status SUCCEEDED
# (Retry) Delete job
# kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml
# Resubmit job
# kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I added the link to the Ray bug and my test results.
Tested RayService 25 times locally, passed every time. #!/bin/bash
for i in {1..25}
do
RAY_IMAGE=rayproject/ray:2.3.0 OPERATOR_IMAGE=controller:latest python3 tests/test_sample_rayservice_yamls.py 2>&1 | tee log_$i.txt
done |
…fore dashboard initialization completes (ray-project#1000) HTTP client from submitting jobs before dashboard initialization completes
Why are these changes needed?
In RayJob, the HTTP client submits a job to the RayCluster only when all Pods are running. However, the readiness of Pods does not necessarily imply that the dashboard is ready. Furthermore, the HTTP client has a timeout of 2 seconds, which is shorter than the time required for the dashboard to initialize. Hence, a weird behavior may happen:
During job submission in KubeRay operator, HTTP clients may fail due to timeout, even though the RayCluster receives the request and launches the job successfully. The KubeRay operator resubmits the job because it thinks that the submission has failed. However, @architkulkarni discovered that resubmitting a job with the same name as a previously submitted job can cause the first submission to fail. This bug was fixed in Ray 2.4.0. Bug: ray-project/ray#31356
The following is the log in my KubeRay operator. It takes at least 4 ~ 5 seconds to receive the response (the dashboard finishes the initialization).
Related issue number
Checks