[release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster #1251

kevin85421 · 2023-07-18T20:51:02Z

Why are these changes needed?

Remove the feature gate PrioritizeWorkersToDelete: The default value of this feature flag has been set to true for a year. ([autoscaler] Flip prioritize-workers-to-delete feature flag #379)

There are two root causes for this issue:

(1) If the Pods in runningPods are also present in workersToDelete, they will be deleted, but will not be removed from runningPods without this pull request. Hence, KubeRay will randomly delete the same number of Pods as those present in both runningPods and workersToDelete. Refer to [Bug] pod randomly deleted by operator in scale-in #1192 (comment) for more details.

(2) L575 isPodRunningOrPendingAndNotDeleting(pod) will always be false because pod here does not set pod.Status.Phase.

kuberay/ray-operator/controllers/ray/raycluster_controller.go

Lines 562 to 580 in e9a2698

    
           for _, podsToDelete := range worker.ScaleStrategy.WorkersToDelete { 
        
           	pod := corev1.Pod{} 
        
           	pod.Name = podsToDelete 
        
           	pod.Namespace = utils.GetNamespace(instance.ObjectMeta) 
        
           	r.Log.Info("Deleting pod", "namespace", pod.Namespace, "name", pod.Name) 
        
           	if err := r.Delete(ctx, &pod); err != nil { 
        
           		if !errors.IsNotFound(err) { 
        
           			return err 
        
           		} 
        
           		r.Log.Info("reconcilePods", "unable to delete worker ", pod.Name) 
        
           	} else { 
        
           		// For example, the failed Pod (Status: corev1.PodFailed) is not counted in the `runningPods` variable. 
        
           		// Therefore, we should not update `diff` when we delete a failed Pod. 
        
           		if isPodRunningOrPendingAndNotDeleting(pod) { 
        
           			diff++ 
        
           		} 
        
           		r.Recorder.Eventf(instance, corev1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name) 
        
           	} 
        
           }

Why couldn't TestReconcile_RemoveWorkersToDelete_OK detect this double deletion issue earlier?
- Without this PR, the workersToDelete in the test has pod1 and pod2. When the test calls reconcilePods, these two Pods will be deleted by this line. These two Pods are still in runningPods and the KubeRay will randomly delete the first two Pods in the PodList in this line. For the fake Kubernetes client, the first two Pods in the PodList will always be pod1 and pod2. Hence, the test will not fail.
- I updated workersToDelete to pod3 and pod4 in (DO NOT MERGE) #1251 Control Group #1252. Both pod3 and pod4 will be deleted, and pod1 and pod2 will be deleted by the random Pod deletion because they are the first two Pods in the PodList. (GitHub Actions)

Related issue number

Closes #1192

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Reproduce

kind create cluster --image=kindest/node:v1.23.0
helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0-rc.0 

# Install an autoscaling-enabled RayCluster with 1 head and 1 worker
# (path: ray-operator/config/samples)
kubectl apply -f ray-cluster.autoscaler.yaml

# Scale up the cluster to 4 CPUs => 1 head + 3 workers 
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec $HEAD_POD -it -c ray-head -- python -c "import ray;ray.init();ray.autoscaler.sdk.request_resources(num_cpus=4)"

# Scale down the cluster to 3 CPUs => expected result: 1 head + 2 workers
# There are two possibilities you may observe:
# Case 1: Observe the double deletion => KubeRay randomly kills a Pod that is not included in `workersToDelete`.
# Case 2: Only one Pod is deleted => KubeRay randomly kills a Pod that is also included in `workersToDelete`.
kubectl exec $HEAD_POD -it -c ray-head -- python -c "import ray;ray.init();ray.autoscaler.sdk.request_resources(num_cpus=3)"

# If you observe Case 2, you can scale up to 4 CPUs and then scale down to 3 CPUs again.

ray-operator/controllers/ray/raycluster_controller.go

…the cluster (ray-project#1251) Randomly delete Pods when scaling down the cluster

…wn the cluster (#1251)

…the cluster (ray-project#1251) Randomly delete Pods when scaling down the cluster

kevin85421 added 4 commits July 18, 2023 20:48

fix

26cf3a0

fix

6bddf90

update

1b2cd6d

update

6464b04

kevin85421 marked this pull request as ready for review July 19, 2023 07:05

kevin85421 changed the title ~~[Bug][Autoscaler] Randomly delete Pods when scaling down the cluster~~ [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster Jul 19, 2023

kevin85421 mentioned this pull request Jul 19, 2023

[Bug] pod randomly deleted by operator in scale-in #1192

Closed

2 tasks

kevin85421 requested review from architkulkarni, gvspraveen and rickyyx July 19, 2023 17:08

gvspraveen reviewed Jul 19, 2023

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Show resolved Hide resolved

gvspraveen reviewed Jul 19, 2023

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

kevin85421 added 2 commits July 19, 2023 18:23

update

d7aba0f

update

9f3821b

gvspraveen approved these changes Jul 19, 2023

View reviewed changes

kevin85421 merged commit f880b72 into ray-project:master Jul 19, 2023
18 checks passed

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Jul 20, 2023

[release blocker][Autoscaler] Randomly delete Pods when scaling down …

6e8fdf4

…the cluster (ray-project#1251) Randomly delete Pods when scaling down the cluster

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Jul 20, 2023

[release blocker][Autoscaler] Randomly delete Pods when scaling down …

511e4b6

…the cluster (ray-project#1251) Randomly delete Pods when scaling down the cluster

kevin85421 mentioned this pull request Jul 20, 2023

[cherry-pick][0.6.0][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251) #1261

Merged

kevin85421 added a commit that referenced this pull request Jul 20, 2023

[cherry-pick][0.6.0][Autoscaler] Randomly delete Pods when scaling do…

010a46e

…wn the cluster (#1251)

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[release blocker][Autoscaler] Randomly delete Pods when scaling down …

f8b22df

…the cluster (ray-project#1251) Randomly delete Pods when scaling down the cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster #1251

[release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster #1251

kevin85421 commented Jul 18, 2023 •

edited

Loading

	for _, podsToDelete := range worker.ScaleStrategy.WorkersToDelete {
	pod := corev1.Pod{}
	pod.Name = podsToDelete
	pod.Namespace = utils.GetNamespace(instance.ObjectMeta)
	r.Log.Info("Deleting pod", "namespace", pod.Namespace, "name", pod.Name)
	if err := r.Delete(ctx, &pod); err != nil {
	if !errors.IsNotFound(err) {
	return err
	}
	r.Log.Info("reconcilePods", "unable to delete worker ", pod.Name)
	} else {
	// For example, the failed Pod (Status: corev1.PodFailed) is not counted in the `runningPods` variable.
	// Therefore, we should not update `diff` when we delete a failed Pod.
	if isPodRunningOrPendingAndNotDeleting(pod) {
	diff++
	}
	r.Recorder.Eventf(instance, corev1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name)
	}
	}

[release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster #1251

[release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster #1251

Conversation

kevin85421 commented Jul 18, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Reproduce

kevin85421 commented Jul 18, 2023 •

edited

Loading