Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes #1000

Merged
merged 1 commit into from
Mar 31, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Mar 31, 2023

Why are these changes needed?

In RayJob, the HTTP client submits a job to the RayCluster only when all Pods are running. However, the readiness of Pods does not necessarily imply that the dashboard is ready. Furthermore, the HTTP client has a timeout of 2 seconds, which is shorter than the time required for the dashboard to initialize. Hence, a weird behavior may happen:

During job submission in KubeRay operator, HTTP clients may fail due to timeout, even though the RayCluster receives the request and launches the job successfully. The KubeRay operator resubmits the job because it thinks that the submission has failed. However, @architkulkarni discovered that resubmitting a job with the same name as a previously submitted job can cause the first submission to fail. This bug was fixed in Ray 2.4.0. Bug: ray-project/ray#31356

The following is the log in my KubeRay operator. It takes at least 4 ~ 5 seconds to receive the response (the dashboard finishes the initialization).

2023-03-31T18:00:10.480Z	INFO	controllers.RayJob	Submit a ray job	{"rayJob": "rayjob-sample", "jobInfo": "{\"entrypoint\":\"python /home/ray/samples/sample_code.py\",\"job_id\":\"rayjob-sample-jh58c\",\"runtime_env\":{\"env_vars\":{\"counter_name\":\"test_counter\"},\"pip\":[\"requests==2.26.0\",\"pendulum==2.1.2\"]}}"}
2023-03-31T18:00:14.328Z	INFO	controllers.RayJob	Job successfully submitted	{"RayJob": "rayjob-sample", "jobId": "rayjob-sample-jh58c"}

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member Author

discovered that resubmitting a job with the same name as a previously submitted job can cause the first submission to fail. This bug was fixed in Ray 2.4.0.

@architkulkarni would you mind adding more details about the race condition bug in Ray? In addition, would you mind testing RayJob with this PR on your laptop? We can run more than 10 times to test the flakiness. I will test the RayService on my devbox. Thanks!

@kevin85421 kevin85421 marked this pull request as ready for review March 31, 2023 18:05
@architkulkarni
Copy link
Contributor

architkulkarni commented Mar 31, 2023

Tested 10x locally, passed every time.

      # This script is used to rebuild the KubeRay operator image and load it into the Kind cluster. 
# Step 0: Delete the existing Kind cluster
kind delete cluster 

# Step 1: Create a Kind cluster
kind create cluster --image=kindest/node:v1.24.0

# Step 2: Modify KubeRay source code
# For example, add a log "Hello KubeRay" in the function `Reconcile` in `raycluster_controller.go`.

# Step 3: Build a Docker image
#         This command will copy the source code directory into the image, and build it.
# Command: IMG={IMG_REPO}:{IMG_TAG} make docker-build
IMG=kuberay/operator:nightly make docker-build

# Step 4: Load the custom KubeRay image into the Kind cluster.
# Command: kind load docker-image {IMG_REPO}:{IMG_TAG}
kind load docker-image kuberay/operator:nightly

# Step 5: Keep consistency
# If you update RBAC or CRD, you need to synchronize them.
# See the section "Consistency check" for more information.

# Step 6: Install KubeRay operator with the custom image via local Helm chart
# (Path: helm-chart/kuberay-operator)
# Command: helm install kuberay-operator --set image.repository={IMG_REPO} --set image.tag={IMG_TAG} .
helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly ~/kuberay/helm-chart/kuberay-operator

# Step 7: Submit a RayJob
kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml   

# Step 8: Check the status of the pods
watch kubectl get pods

# When the pods are ready, check `kubectl logs` to see that the job has status SUCCEEDED

# (Retry) Delete job
# kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml
# Resubmit job
# kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml   

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I added the link to the Ray bug and my test results.

@kevin85421
Copy link
Member Author

Tested RayService 25 times locally, passed every time.

#!/bin/bash
for i in {1..25} 
do
  RAY_IMAGE=rayproject/ray:2.3.0 OPERATOR_IMAGE=controller:latest python3 tests/test_sample_rayservice_yamls.py 2>&1 | tee log_$i.txt
done

@kevin85421 kevin85421 merged commit 00dc45a into ray-project:master Mar 31, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…fore dashboard initialization completes (ray-project#1000)

HTTP client from submitting jobs before dashboard initialization completes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants