Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

Merged

Conversation

DmitriGekhtman
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman commented May 24, 2022

Why are these changes needed?

Upstreams recent autoscaler changes from the Ray repo.

  • The Ray image is used to run the autoscaler, using the new entrypoint ray kuberay-autoscaler.

  • Redis passwords are removed from autoscaler configuration, since it's not relevant for Ray >= 1.11.0 and we don't support autoscaling with older Ray.

  • The autoscaler and Ray container need to share a log volume so that Ray drivers can receive autoscaling events. This PR has the operator configure the relevant volume mounts when autoscaling is enabled. Existing code for configuring volumeMounts is generalized for this purpose.

  • Some unit test logic around volume mounts is added.

  • Documentation is updated.

    • Deployment steps for setting up autoscaling clusters involve an overlay to set the "prioritizeWorkersToDelete" feature flag on the operator.
    • Commands involving CRD creation are updated from "create" to "apply", due to large CRD size ([Bug] Issues with RayCluster CRD and kubectl apply #271)

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests: Deployed with the updated instructions, checked that autoscaling works.

@@ -8,7 +8,7 @@ metadata:
# An unique identifier for the head node and workers of this cluster.
name: raycluster-autoscaler
spec:
rayVersion: 'nightly'
rayVersion: '1.12.1'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest Ray release is compatible with the pinned autoscaler image.

func addEmptyDir(container *v1.Container, pod *v1.Pod) {
if checkIfVolumeMounted(container, pod) {
func addEmptyDir(container *v1.Container, pod *v1.Pod, volumeName string, volumeMountPath string, storageMedium v1.StorageMedium) {
if checkIfVolumeMounted(container, pod, volumeMountPath) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized for purposes of adding ray-log volume.

container.VolumeMounts = append(container.VolumeMounts, mountedVolume)
}
}

func checkIfVolumeMounted(container *v1.Container, pod *v1.Pod) bool {
//Format an emptyDir volume.
//When the storage medium is memory, set the size limit based on container resources.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size limit calculation is kept as it was previously.

@@ -47,7 +49,7 @@ var instance = rayiov1alpha1.RayCluster{
Containers: []v1.Container{
{
Name: "ray-head",
Image: "rayproject/autoscaler",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rayproject/autoscaler is deprecated

},
Limits: v1.ResourceList{
v1.ResourceCPU: resource.MustParse("1"),
v1.ResourceMemory: testMemoryLimit,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to test /dev/shm volume size limit.

@DmitriGekhtman DmitriGekhtman marked this pull request as ready for review May 24, 2022 18:16
@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented May 27, 2022

Though actually right now this wouldn't work, because there's a bug with the autoscaler container entrypoint that prevents logs being written to the right place -- I'm working on fixing that.

That bug fix is here.

```

> Note: For compatibility with the Ray autoscaler, the KubeRay Operator's entrypoint
> must include the flag `--prioritize-workers-to-delete`. The kustomization overlay
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to have a plan to remove this flag and make it the default behavior.

ray-operator/controllers/raycluster/common/pod.go Outdated Show resolved Hide resolved
ray-operator/controllers/raycluster/common/pod.go Outdated Show resolved Hide resolved
ray-operator/controllers/raycluster/common/pod.go Outdated Show resolved Hide resolved
@pcmoritz
Copy link
Collaborator

Small consistency ask before we merge: Can we always have a space after // in comments to be consistent?

@DmitriGekhtman
Copy link
Collaborator Author

Small consistency ask before we merge: Can we always have a space after // in comments to be consistent?

I think I got all of the ones from this PR.
This is sad: golang/go#30540

@@ -259,14 +359,6 @@ func TestDefaultHeadPodTemplate_WithAutoscalingEnabled(t *testing.T) {
}
}

func TestBuildAutoscalerContainer(t *testing.T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic moved into TestBuildPod_with_autoscaler_enabled.

@pcmoritz
Copy link
Collaborator

Thanks for updating the comments :)

}

// not found, use second container
// (This branch shouldn't be accessed -- the autoscaler container should be present.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log an error or warning here, or possibly even exit the program if this happens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were a Python program, I'd add an assertion.

The most correct thing would probably be to bubble an error a couple layers up the call stack, but that's a bit too much work for something that's not logically possible.

I will add a log statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or panic, maybe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, why don't we add a warning log statement both here and in the function above that we use the first or second container respectively :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic for autoscaler container, since it's supposed to be logically impossible.
Info statements for Ray container. IMO, using the first container should be the standard (also, that's the pattern suggested by the sample configs.)

}

// This should be unreachable.
panic("Autoscaler container not found!")
Copy link
Collaborator Author

@DmitriGekhtman DmitriGekhtman May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's standard to panic in a code branch that's not supposed to be accessible.

return i
}
}
}
// not found, use first container
log.Info("Head pod container with index 0 identified as Ray container.")
Copy link
Collaborator Author

@DmitriGekhtman DmitriGekhtman May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to document the requirement that the Ray container goes first.
To me, it seems odd to identify a container by an env variable -- also, none of the kuberay sample configs I've seen do that.

@pcmoritz
Copy link
Collaborator

SGTM!

@DmitriGekhtman DmitriGekhtman merged commit eac60b3 into ray-project:master May 28, 2022
@DmitriGekhtman DmitriGekhtman deleted the dmitri/update-autoscaler-image branch May 29, 2022 01:32
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…mprovements to Kuberay NodeProvider (ray-project#274)

Upstreams recent autoscaler changes from the Ray repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants