Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Clean up init container configuration and startup sequence. #476

Closed
2 tasks done
DmitriGekhtman opened this issue Aug 15, 2022 · 9 comments · Fixed by #973
Closed
2 tasks done

[Feature] Clean up init container configuration and startup sequence. #476

DmitriGekhtman opened this issue Aug 15, 2022 · 9 comments · Fixed by #973
Assignees
Labels
enhancement New feature or request P1 Issue that should be fixed within a few weeks stability Pertains to basic infrastructure stability

Comments

@DmitriGekhtman
Copy link
Collaborator

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

At the moment, we instruct users to include an init container in each worker group spec.
The purpose of the init container is to wait for the service exposing the head GCS server to be created before the worker attempts ray start.

There are two issues with the current setup:

  • Having to include the init container makes the minimal configuration for a RayCluster messier. If an init container is necessary, it would be better to have the KubeRay operator create it by default.
  • The current logic is not quite correct, for the following reason:
    After the initContainer determines that the head service is ready, the Ray worker container immediately runs Ray start,
    whether or not the GCS is ready. Ray start has internal retry logic that eventually gives up if the head pod is not started
    quickly enough -- the worker container will then crash-loop. (This is not that bad given the typical time scales for
    provisioning Ray pods and given ray start's internal timeout.)

The tasks are to simplify configuration and correct the logic.

Two ways to correct the logic:

  1. Implement an initContainer that waits for the GCS to be ready.
  2. Drop the initContainer and just have the Ray container's entry-point wait as long as necessary.

Advantage of 2. is that it's simpler.

Advantage of 1. is that it's perhaps more idiomatic and gives more feedback to a user who is examining worker pod status with kubectl get pod -- the user can distinguish "Initializing" and "Running" states for the worker container.

If we stick with an initContainer (option 2), we can either

  1. Have the operator configure it automatically OR
  2. Leave it alone, leave it to Helm to hide that configuration, and invest in Helm as the preferred means of deploying.

Use case

Interface cleanup.

Related issues

This falls under the generic category of "interface cleanup", for which we have this issue:
#368

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@DmitriGekhtman DmitriGekhtman added the enhancement New feature or request label Aug 15, 2022
@DmitriGekhtman DmitriGekhtman added this to the v0.4.0 release milestone Aug 15, 2022
@akanso
Copy link
Collaborator

akanso commented Aug 15, 2022

I think shielding the user from the init-container logic has its advantages.

We can have the Operator add the init-container in the pod. (We need to make sure here that we append it to the list of init containers that the user may have already defined for other purposes).

Since Helm is not the only way of deployment, I don't think we should add the init-container there since some users of KubeRay might miss this logic completely if they do not use helm.

One question here is how does the init container "waits for the GCS to be ready"?

@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Aug 15, 2022

All Ray versions since ~1.4.0 have a cli command ray health-check that can health-check the GCS.

One idea is to use the Ray image for the init container and loop on
ray health-check --address <head-service>:<port> with a five second backoff, tolerating any error.

There's no overhead from pulling the Ray image, since you need it anyway to run Ray.

@DmitriGekhtman DmitriGekhtman added P2 Important issue, but not time critical P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time critical labels Nov 3, 2022
@DmitriGekhtman
Copy link
Collaborator Author

It's not urgent to fix in the next release -- it works well enough to copy-paste the extra configuration.

@DmitriGekhtman
Copy link
Collaborator Author

Some recent discussion:
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669991115558659

Just a basic ping to the GCS server should do the trick.

@DmitriGekhtman DmitriGekhtman added the stability Pertains to basic infrastructure stability label Dec 2, 2022
@kevin85421
Copy link
Member

Hi @DmitriGekhtman,

Would you mind explaining some details about this issue? I have read this issue and related slack discussions. In my understanding, current solution is to

  • Make the KubeRay operator configure the initContainer for users.
    • Use ray health-check to wait until GCS is ready.
  • Remove initContainer from all sample YAML files. For example,
    initContainers:
    # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
    - name: init
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]

Is it correct? Thank you!

@kevin85421 kevin85421 self-assigned this Jan 1, 2023
@DmitriGekhtman
Copy link
Collaborator Author

I'd recommend

First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.

Next, investigate whether we need to do change anything for the worker startup sequence.
One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.

@DmitriGekhtman
Copy link
Collaborator Author

Next, investigate whether we need to do change anything for the worker startup sequence.
One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.

One option is to modify the Ray code to make the number of retries adjustable via env variable.
Then have the operator set the env variable. This would prevent crashloops for new Ray versions.

I think the retry logic is here.

@kevin85421
Copy link
Member

First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.

What's the difference between "recent Ray versions" and "very old Ray versions" mentioned above? Thank you!

@DmitriGekhtman
Copy link
Collaborator Author

Uh, good question. I actually don't know - @akanso mentioned that the init containers were necessary to get things to work for older Ray versions, roughly two years ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 Issue that should be fixed within a few weeks stability Pertains to basic infrastructure stability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants