-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[K8S] Make Ray head service headless #33030
Conversation
Signed-off-by: Yiqing Wang <[email protected]>
@architkulkarni @jjyao Please feel free to add more context here. Thanks! I think we may need to check the KubeRay as well to check how it handles the networking. |
@@ -198,7 +199,7 @@ spec: | |||
imagePullPolicy: Always | |||
command: ["/bin/bash", "-c", "--"] | |||
args: | |||
- "ray start --num-cpus=$MY_CPU_REQUEST --address=$SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block" | |||
- "ray start --num-cpus=$MY_CPU_REQUEST --address=service-ray-cluster:6380 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service is headless, the ip cannot be exposed using the env vars.
https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the context!
I cannot understand this part. The following is my understanding about this paragraph after the discussion with @architkulkarni. Can you please confirm if my understanding is correct?
Q: Does this mean that the headless Ray head service allows Ray head Pod to access the Ray head service? Would you mind adding some details about how to reproduce it? I hope to check whether KubeRay has the same issue or not. Thanks! |
@kevin85421 The worker uses whatever passes in the ray start cli as the GCS address, in this case, is the service IP. The head node initially uses the external pod ip (10.xxx) to set up the GCS, but it also uses the same service IP as the worker uses in the GcsAioClient to connect to the GCS. It cannot reach the service IP (GCS address) from its own pod unless we set the head service to headless. I can reproduce it by kubectl applying the config yaml in the original codebase. But note that it is not guaranteed to happen every time because the job agent is set randomly between the head and the workers. The issue only occurs when it is in the worker. |
I have an offline chat with @YQ-Wang. The following is my current understanding: Root cause
This PR
My questionThanks @YQ-Wang for the explanation! I tried to verify the limitation "However, there is a limitation for Kubernetes. A Pod cannot communicate with itself via its non-headless Kubernetes service.". Nevertheless, the result is weird. # Create a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0
# Install a KubeRay operator and a RayCluster custom resource.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0
# Make sure the head service is not a headless service
kubectl get svc raycluster-kuberay-head-svc -o jsonpath='{.spec.clusterIP}'
# 10.96.39.228
# Log in to the head Pod
kubectl exec -it ${HEAD_SVC} -- bash
# Install curl in the head Pod
sudo apt-get update; sudo apt-get install -y curl
# Try to communicate with the head Pod itself via head service in the head Pod
curl -I raycluster-kuberay-head-svc:8265
# HTTP/1.1 200 OK
# Content-Type: text/html
# Etag: "170de3004cf1c800-bdc"
# Last-Modified: Tue, 23 Aug 2022 05:43:48 GMT
# Content-Length: 3036
# Accept-Ranges: bytes
# Date: Tue, 07 Mar 2023 00:44:36 GMT
# Server: Python/3.7 aiohttp/3.8.1
ray health-check --address raycluster-kuberay-head-svc:6379
echo $?
# 0 => GCS is healthy Does the restriction only happen in specific Kubernetes versions? Thanks! |
@kevin85421 Seems you are using KubeRay, I am not sure if there are any CRDs that might impact the default network setting. Please check out this article, which can help you understand the regular networking of K8S. |
I tried to deploy ray cluster on both Kubernetes v1.19.11 and v1.23.0 with this doc, but still got the same result. I am not sure whether this only happens in specific Kubernetes version or distribution (e.g. OpenShift) or not. kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml
kubectl get svc service-ray-cluster -o jsonpath='{.spec.clusterIP}'
# 10.96.89.173 => this is not a headless service |
@kevin85421 I did some research online and it is the CNI plugin disabling the hairpin traffic. Source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This PR is ready to merge.
I failed to disable the hairpin mode. Open an issue kubernetes-sigs/kind#3118 to track the progress. KubeRay may need to make the service become headless too.
Linkcheck failure unrelated (tune/examples/tune-xgboost: line 30095) broken https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652 - 403 Client Error: Forbidden for url: https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652 Merging. |
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: Jack He <[email protected]>
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: Edward Oakes <[email protected]>
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: elliottower <[email protected]>
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail. This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod. In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem. To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod. Signed-off-by: Jack He <[email protected]>
Why are these changes needed?
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.
This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.
In addition, the
choose_agent
code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.