[K8S] Make Ray head service headless #33030

YQ-Wang · 2023-03-04T06:41:53Z

Why are these changes needed?

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Tested locally and in a cluster
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yiqing Wang <[email protected]>

YQ-Wang · 2023-03-04T06:44:55Z

@architkulkarni @jjyao Please feel free to add more context here. Thanks!

I think we may need to check the KubeRay as well to check how it handles the networking.

jjyao · 2023-03-06T04:28:33Z

doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml

@@ -198,7 +199,7 @@ spec:
        imagePullPolicy: Always
        command: ["/bin/bash", "-c", "--"]
        args:
-          - "ray start --num-cpus=$MY_CPU_REQUEST --address=$SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"
+          - "ray start --num-cpus=$MY_CPU_REQUEST --address=service-ray-cluster:6380 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"


why do we need to change this line?

If the service is headless, the ip cannot be exposed using the env vars.

https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables

architkulkarni

Thanks for adding the context!

kevin85421 · 2023-03-06T22:10:39Z

Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.
This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

I cannot understand this part. The following is my understanding about this paragraph after the discussion with @architkulkarni. Can you please confirm if my understanding is correct?

Ray head Pod and worker Pods use different address to connect to the same GCS server.
- GCS address used in Ray worker Pods: "external address" => What's the external address? Is it the name of the head ClusterIP service?
Ray head cannot connect to the GCS server via "external address".

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Q: Does this mean that the headless Ray head service allows Ray head Pod to access the Ray head service?

Would you mind adding some details about how to reproduce it? I hope to check whether KubeRay has the same issue or not. Thanks!

YQ-Wang · 2023-03-06T22:29:41Z

@kevin85421 The worker uses whatever passes in the ray start cli as the GCS address, in this case, is the service IP. The head node initially uses the external pod ip (10.xxx) to set up the GCS, but it also uses the same service IP as the worker uses in the GcsAioClient to connect to the GCS. It cannot reach the service IP (GCS address) from its own pod unless we set the head service to headless.

I can reproduce it by kubectl applying the config yaml in the original codebase. But note that it is not guaranteed to happen every time because the job agent is set randomly between the head and the workers. The issue only occurs when it is in the worker.

kevin85421 · 2023-03-07T00:46:26Z

I have an offline chat with @YQ-Wang. The following is my current understanding:

Root cause

When Ray head Pod starts, it will use external IP (i.e. Pod IP) to connect to GCS server. => Succeed

external IP is equivalent to Pod IP (10.x.x.x). You can use kubectl get pods -o wide to check it.

Example

```sh
➜  ~ kubectl get pods -o wide
NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINESS GATES
kuberay-operator-7fb4677468-gd7nx             1/1     Running   0          78m   10.244.0.5    kind-control-plane   <none>           <none>
raycluster-kuberay-head-qkdfl                 1/1     Running   0          13m   10.244.0.10   kind-control-plane   <none>           <none>
raycluster-kuberay-worker-workergroup-6bc7z   1/1     Running   0          13m   10.244.0.11   kind-control-plane   <none>           <none>
```

Ray worker Pods will use the address that specified by ray start --address ${ADDR} ... to connect to the GCS server. Without this PR, worker Pods will use the head's Kubernetes ClusterIP service. To elaborate, they will use the two environment variables set by Kubernetes to get the ClusterIP address, i.e.
- $SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER
When a user uses ray job submit to submit a job to the head Pod, the job will be allocated to a job agent on a head Pod or a worker Pod. If the job is allocated to an agent on a worker Pod, the job agent will communicate with the head Pod and asks the head Pod to connect to the GCS server via the worker node's GCS server address (i.e. $SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER).
However, there is a limitation for Kubernetes. A Pod cannot communicate with itself via its non-headless Kubernetes service. Hence, head Pod cannot use its ClusterIP service (i.e. $SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER) to communicate to itself. That's why the job will fail.

This PR

Kubernetes allows a Pod to communicate with itself via its headless Kubernetes service.
However, when a Kubernetes service is headless, Kubernetes will not set these two environment variables ($SERVICE_RAY_CLUSTER_SERVICE_HOST, $SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER) automatically.

My question

Thanks @YQ-Wang for the explanation! I tried to verify the limitation "However, there is a limitation for Kubernetes. A Pod cannot communicate with itself via its non-headless Kubernetes service.". Nevertheless, the result is weird.

# Create a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0

# Install a KubeRay operator and a RayCluster custom resource.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0

# Make sure the head service is not a headless service
kubectl get svc raycluster-kuberay-head-svc -o jsonpath='{.spec.clusterIP}'
# 10.96.39.228

# Log in to the head Pod
kubectl exec -it ${HEAD_SVC} -- bash

# Install curl in the head Pod
sudo apt-get update; sudo apt-get install -y curl

# Try to communicate with the head Pod itself via head service in the head Pod
curl -I raycluster-kuberay-head-svc:8265

# HTTP/1.1 200 OK
# Content-Type: text/html
# Etag: "170de3004cf1c800-bdc"
# Last-Modified: Tue, 23 Aug 2022 05:43:48 GMT
# Content-Length: 3036
# Accept-Ranges: bytes
# Date: Tue, 07 Mar 2023 00:44:36 GMT
# Server: Python/3.7 aiohttp/3.8.1

ray health-check --address raycluster-kuberay-head-svc:6379
echo $?
# 0 => GCS is healthy

Does the restriction only happen in specific Kubernetes versions? Thanks!

YQ-Wang · 2023-03-07T01:04:58Z

@kevin85421 Seems you are using KubeRay, I am not sure if there are any CRDs that might impact the default network setting. Please check out this article, which can help you understand the regular networking of K8S.

kevin85421 · 2023-03-07T01:39:01Z

I tried to deploy ray cluster on both Kubernetes v1.19.11 and v1.23.0 with this doc, but still got the same result. I am not sure whether this only happens in specific Kubernetes version or distribution (e.g. OpenShift) or not.

kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml

kubectl get svc service-ray-cluster -o jsonpath='{.spec.clusterIP}'
# 10.96.89.173 => this is not a headless service

YQ-Wang · 2023-03-07T20:28:25Z

@kevin85421 I did some research online and it is the CNI plugin disabling the hairpin traffic.
Could you try the example here?

Source:
kubernetes/kubernetes#45790
containernetworking/cni#476

kevin85421

LGTM. This PR is ready to merge.

I failed to disable the hairpin mode. Open an issue kubernetes-sigs/kind#3118 to track the progress. KubeRay may need to make the service become headless too.

architkulkarni · 2023-03-08T19:18:42Z

Linkcheck failure unrelated (tune/examples/tune-xgboost: line 30095) broken https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652 - 403 Client Error: Forbidden for url: https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652

Merging.

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: Jack He <[email protected]>

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: Edward Oakes <[email protected]>

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: elliottower <[email protected]>

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: Jack He <[email protected]>

make ray head headless

9173679

Signed-off-by: Yiqing Wang <[email protected]>

YQ-Wang requested review from architkulkarni, wuisawesome, DmitriGekhtman, maxpumperla, pcmoritz and a team as code owners March 4, 2023 06:41

jjyao assigned architkulkarni and DmitriGekhtman Mar 6, 2023

jjyao reviewed Mar 6, 2023

View reviewed changes

DmitriGekhtman assigned kevin85421 Mar 6, 2023

YQ-Wang requested a review from jjyao March 6, 2023 20:11

architkulkarni approved these changes Mar 6, 2023

View reviewed changes

kevin85421 approved these changes Mar 8, 2023

View reviewed changes

kevin85421 mentioned this pull request Mar 8, 2023

[Feature] Make the head service headless ray-project/kuberay#948

Closed

2 tasks

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 8, 2023

architkulkarni merged commit 541c5f5 into ray-project:master Mar 8, 2023

andy108369 mentioned this pull request Aug 17, 2024

no ingress, nodePort network access between two deployments running on the same provider akash-network/support#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[K8S] Make Ray head service headless #33030

[K8S] Make Ray head service headless #33030

YQ-Wang commented Mar 4, 2023 •

edited

Loading

YQ-Wang commented Mar 4, 2023

jjyao Mar 6, 2023

YQ-Wang Mar 6, 2023

architkulkarni left a comment

kevin85421 commented Mar 6, 2023

YQ-Wang commented Mar 6, 2023 •

edited

Loading

kevin85421 commented Mar 7, 2023

YQ-Wang commented Mar 7, 2023

kevin85421 commented Mar 7, 2023

YQ-Wang commented Mar 7, 2023

kevin85421 left a comment

architkulkarni commented Mar 8, 2023

[K8S] Make Ray head service headless #33030

[K8S] Make Ray head service headless #33030

Conversation

YQ-Wang commented Mar 4, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

YQ-Wang commented Mar 4, 2023

jjyao Mar 6, 2023

Choose a reason for hiding this comment

YQ-Wang Mar 6, 2023

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Mar 6, 2023

YQ-Wang commented Mar 6, 2023 • edited Loading

kevin85421 commented Mar 7, 2023

Root cause

This PR

My question

YQ-Wang commented Mar 7, 2023

kevin85421 commented Mar 7, 2023

YQ-Wang commented Mar 7, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

architkulkarni commented Mar 8, 2023

YQ-Wang commented Mar 4, 2023 •

edited

Loading

YQ-Wang commented Mar 6, 2023 •

edited

Loading