[Feature] Clean up init container configuration and startup sequence. #476

DmitriGekhtman · 2022-08-15T21:55:47Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

At the moment, we instruct users to include an init container in each worker group spec.
The purpose of the init container is to wait for the service exposing the head GCS server to be created before the worker attempts ray start.

There are two issues with the current setup:

Having to include the init container makes the minimal configuration for a RayCluster messier. If an init container is necessary, it would be better to have the KubeRay operator create it by default.
The current logic is not quite correct, for the following reason:
After the initContainer determines that the head service is ready, the Ray worker container immediately runs Ray start,
whether or not the GCS is ready. Ray start has internal retry logic that eventually gives up if the head pod is not started
quickly enough -- the worker container will then crash-loop. (This is not that bad given the typical time scales for
provisioning Ray pods and given ray start's internal timeout.)

The tasks are to simplify configuration and correct the logic.

Two ways to correct the logic:

Implement an initContainer that waits for the GCS to be ready.
Drop the initContainer and just have the Ray container's entry-point wait as long as necessary.

Advantage of 2. is that it's simpler.

Advantage of 1. is that it's perhaps more idiomatic and gives more feedback to a user who is examining worker pod status with kubectl get pod -- the user can distinguish "Initializing" and "Running" states for the worker container.

If we stick with an initContainer (option 2), we can either

Have the operator configure it automatically OR
Leave it alone, leave it to Helm to hide that configuration, and invest in Helm as the preferred means of deploying.

Use case

Interface cleanup.

Related issues

This falls under the generic category of "interface cleanup", for which we have this issue:
#368

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

akanso · 2022-08-15T22:29:27Z

I think shielding the user from the init-container logic has its advantages.

We can have the Operator add the init-container in the pod. (We need to make sure here that we append it to the list of init containers that the user may have already defined for other purposes).

Since Helm is not the only way of deployment, I don't think we should add the init-container there since some users of KubeRay might miss this logic completely if they do not use helm.

One question here is how does the init container "waits for the GCS to be ready"?

DmitriGekhtman · 2022-08-15T22:39:32Z

All Ray versions since ~1.4.0 have a cli command ray health-check that can health-check the GCS.

One idea is to use the Ray image for the init container and loop on
ray health-check --address <head-service>:<port> with a five second backoff, tolerating any error.

There's no overhead from pulling the Ray image, since you need it anyway to run Ray.

DmitriGekhtman · 2022-11-04T15:51:45Z

It's not urgent to fix in the next release -- it works well enough to copy-paste the extra configuration.

DmitriGekhtman · 2022-12-02T19:41:48Z

Some recent discussion:
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669991115558659

Just a basic ping to the GCS server should do the trick.

kevin85421 · 2023-01-01T13:14:14Z

Hi @DmitriGekhtman,

Would you mind explaining some details about this issue? I have read this issue and related slack discussions. In my understanding, current solution is to

Make the KubeRay operator configure the initContainer for users.
- Use ray health-check to wait until GCS is ready.

Remove initContainer from all sample YAML files. For example,

kuberay/ray-operator/config/samples/ray-cluster.autoscaler.yaml

Lines 125 to 129 in 633ff63

    
           initContainers: 
        
           # the env var $RAY_IP is set by the operator if missing, with the value of the head service name 
        
           - name: init 
        
             image: busybox:1.28 
        
             command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]

Is it correct? Thank you!

DmitriGekhtman · 2023-01-01T20:53:35Z

I'd recommend

First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.

Next, investigate whether we need to do change anything for the worker startup sequence.
One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.

DmitriGekhtman · 2023-01-01T21:10:48Z

Next, investigate whether we need to do change anything for the worker startup sequence.
One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.

One option is to modify the Ray code to make the number of retries adjustable via env variable.
Then have the operator set the env variable. This would prevent crashloops for new Ray versions.

I think the retry logic is here.

kevin85421 · 2023-01-02T16:23:40Z

First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.

What's the difference between "recent Ray versions" and "very old Ray versions" mentioned above? Thank you!

DmitriGekhtman · 2023-01-02T16:27:50Z

Uh, good question. I actually don't know - @akanso mentioned that the init containers were necessary to get things to work for older Ray versions, roughly two years ago.

DmitriGekhtman added the enhancement New feature or request label Aug 15, 2022

DmitriGekhtman added this to the v0.4.0 release milestone Aug 15, 2022

DmitriGekhtman added P2 Important issue, but not time critical P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time critical labels Nov 3, 2022

DmitriGekhtman mentioned this issue Nov 3, 2022

[Feature] Expose initContainer image in RayCluster chart #674

Merged

4 tasks

DmitriGekhtman modified the milestones: v0.4.0 release, v0.5.0 release Nov 4, 2022

DmitriGekhtman mentioned this issue Nov 30, 2022

refactor: log more descriptive info from initContainer #526

Merged

4 tasks

DmitriGekhtman added the stability Pertains to basic infrastructure stability label Dec 2, 2022

kevin85421 self-assigned this Jan 1, 2023

kevin85421 mentioned this issue Mar 17, 2023

[Feature] Add default init container in workers to wait for GCS to be ready #973

Merged

4 tasks

kevin85421 closed this as completed in #973 Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Clean up init container configuration and startup sequence. #476

[Feature] Clean up init container configuration and startup sequence. #476

DmitriGekhtman commented Aug 15, 2022

akanso commented Aug 15, 2022

DmitriGekhtman commented Aug 15, 2022 •

edited

Loading

DmitriGekhtman commented Nov 4, 2022

DmitriGekhtman commented Dec 2, 2022

kevin85421 commented Jan 1, 2023

DmitriGekhtman commented Jan 1, 2023

DmitriGekhtman commented Jan 1, 2023

kevin85421 commented Jan 2, 2023

DmitriGekhtman commented Jan 2, 2023

[Feature] Clean up init container configuration and startup sequence. #476

[Feature] Clean up init container configuration and startup sequence. #476

Comments

DmitriGekhtman commented Aug 15, 2022

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

akanso commented Aug 15, 2022

DmitriGekhtman commented Aug 15, 2022 • edited Loading

DmitriGekhtman commented Nov 4, 2022

DmitriGekhtman commented Dec 2, 2022

kevin85421 commented Jan 1, 2023

DmitriGekhtman commented Jan 1, 2023

DmitriGekhtman commented Jan 1, 2023

kevin85421 commented Jan 2, 2023

DmitriGekhtman commented Jan 2, 2023

DmitriGekhtman commented Aug 15, 2022 •

edited

Loading