Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8S] Make Ray head service headless #33030

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

YQ-Wang
Copy link
Contributor

@YQ-Wang YQ-Wang commented Mar 4, 2023

Why are these changes needed?

We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Tested locally and in a cluster
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yiqing Wang <[email protected]>
@YQ-Wang
Copy link
Contributor Author

YQ-Wang commented Mar 4, 2023

@architkulkarni @jjyao Please feel free to add more context here. Thanks!

I think we may need to check the KubeRay as well to check how it handles the networking.

@@ -198,7 +199,7 @@ spec:
imagePullPolicy: Always
command: ["/bin/bash", "-c", "--"]
args:
- "ray start --num-cpus=$MY_CPU_REQUEST --address=$SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"
- "ray start --num-cpus=$MY_CPU_REQUEST --address=service-ray-cluster:6380 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the context!

@kevin85421
Copy link
Member

Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.
This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

I cannot understand this part. The following is my understanding about this paragraph after the discussion with @architkulkarni. Can you please confirm if my understanding is correct?

  1. Ray head Pod and worker Pods use different address to connect to the same GCS server.

    • GCS address used in Ray worker Pods: "external address" => What's the external address? Is it the name of the head ClusterIP service?
  2. Ray head cannot connect to the GCS server via "external address".

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Q: Does this mean that the headless Ray head service allows Ray head Pod to access the Ray head service?

Would you mind adding some details about how to reproduce it? I hope to check whether KubeRay has the same issue or not. Thanks!

@YQ-Wang
Copy link
Contributor Author

YQ-Wang commented Mar 6, 2023

@kevin85421 The worker uses whatever passes in the ray start cli as the GCS address, in this case, is the service IP. The head node initially uses the external pod ip (10.xxx) to set up the GCS, but it also uses the same service IP as the worker uses in the GcsAioClient to connect to the GCS. It cannot reach the service IP (GCS address) from its own pod unless we set the head service to headless.

I can reproduce it by kubectl applying the config yaml in the original codebase. But note that it is not guaranteed to happen every time because the job agent is set randomly between the head and the workers. The issue only occurs when it is in the worker.

@kevin85421
Copy link
Member

I have an offline chat with @YQ-Wang. The following is my current understanding:

Root cause

  1. When Ray head Pod starts, it will use external IP (i.e. Pod IP) to connect to GCS server. => Succeed

    • external IP is equivalent to Pod IP (10.x.x.x). You can use kubectl get pods -o wide to check it.

      Example
      ```sh
      ➜  ~ kubectl get pods -o wide
      NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINESS GATES
      kuberay-operator-7fb4677468-gd7nx             1/1     Running   0          78m   10.244.0.5    kind-control-plane   <none>           <none>
      raycluster-kuberay-head-qkdfl                 1/1     Running   0          13m   10.244.0.10   kind-control-plane   <none>           <none>
      raycluster-kuberay-worker-workergroup-6bc7z   1/1     Running   0          13m   10.244.0.11   kind-control-plane   <none>           <none>
      ```
      
  2. Ray worker Pods will use the address that specified by ray start --address ${ADDR} ... to connect to the GCS server. Without this PR, worker Pods will use the head's Kubernetes ClusterIP service. To elaborate, they will use the two environment variables set by Kubernetes to get the ClusterIP address, i.e.

    • $SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER
  3. When a user uses ray job submit to submit a job to the head Pod, the job will be allocated to a job agent on a head Pod or a worker Pod. If the job is allocated to an agent on a worker Pod, the job agent will communicate with the head Pod and asks the head Pod to connect to the GCS server via the worker node's GCS server address (i.e. $SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER).

  4. However, there is a limitation for Kubernetes. A Pod cannot communicate with itself via its non-headless Kubernetes service. Hence, head Pod cannot use its ClusterIP service (i.e. $SERVICE_RAY_CLUSTER_SERVICE_HOST:$SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER) to communicate to itself. That's why the job will fail.

This PR

  1. Kubernetes allows a Pod to communicate with itself via its headless Kubernetes service.

  2. However, when a Kubernetes service is headless, Kubernetes will not set these two environment variables ($SERVICE_RAY_CLUSTER_SERVICE_HOST, $SERVICE_RAY_CLUSTER_SERVICE_PORT_GCS_SERVER) automatically.

My question

Thanks @YQ-Wang for the explanation! I tried to verify the limitation "However, there is a limitation for Kubernetes. A Pod cannot communicate with itself via its non-headless Kubernetes service.". Nevertheless, the result is weird.

# Create a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0

# Install a KubeRay operator and a RayCluster custom resource.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0
helm install raycluster kuberay/ray-cluster --version 0.4.0

# Make sure the head service is not a headless service
kubectl get svc raycluster-kuberay-head-svc -o jsonpath='{.spec.clusterIP}'
# 10.96.39.228

# Log in to the head Pod
kubectl exec -it ${HEAD_SVC} -- bash

# Install curl in the head Pod
sudo apt-get update; sudo apt-get install -y curl

# Try to communicate with the head Pod itself via head service in the head Pod
curl -I raycluster-kuberay-head-svc:8265

# HTTP/1.1 200 OK
# Content-Type: text/html
# Etag: "170de3004cf1c800-bdc"
# Last-Modified: Tue, 23 Aug 2022 05:43:48 GMT
# Content-Length: 3036
# Accept-Ranges: bytes
# Date: Tue, 07 Mar 2023 00:44:36 GMT
# Server: Python/3.7 aiohttp/3.8.1

ray health-check --address raycluster-kuberay-head-svc:6379
echo $?
# 0 => GCS is healthy

Does the restriction only happen in specific Kubernetes versions? Thanks!

@YQ-Wang
Copy link
Contributor Author

YQ-Wang commented Mar 7, 2023

@kevin85421 Seems you are using KubeRay, I am not sure if there are any CRDs that might impact the default network setting. Please check out this article, which can help you understand the regular networking of K8S.

@kevin85421
Copy link
Member

I tried to deploy ray cluster on both Kubernetes v1.19.11 and v1.23.0 with this doc, but still got the same result. I am not sure whether this only happens in specific Kubernetes version or distribution (e.g. OpenShift) or not.

kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml

kubectl get svc service-ray-cluster -o jsonpath='{.spec.clusterIP}'
# 10.96.89.173 => this is not a headless service

Screen Shot 2023-03-06 at 5 20 12 PM

Screen Shot 2023-03-06 at 5 35 56 PM

@YQ-Wang
Copy link
Contributor Author

YQ-Wang commented Mar 7, 2023

@kevin85421 I did some research online and it is the CNI plugin disabling the hairpin traffic.
Could you try the example here?

Source:
kubernetes/kubernetes#45790
containernetworking/cni#476

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This PR is ready to merge.

I failed to disable the hairpin mode. Open an issue kubernetes-sigs/kind#3118 to track the progress. KubeRay may need to make the service become headless too.

@architkulkarni
Copy link
Contributor

Linkcheck failure unrelated (tune/examples/tune-xgboost: line 30095) broken https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652 - 403 Client Error: Forbidden for url: https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652

Merging.

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 8, 2023
@architkulkarni architkulkarni merged commit 541c5f5 into ray-project:master Mar 8, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Signed-off-by: Jack He <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Signed-off-by: Edward Oakes <[email protected]>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
We have recently discovered that some submitted jobs may remain pending for up to two minutes before failing. And once the first job has this issue, the coming up jobs always have the same problem. This issue arises from the random allocation of the job agent, which can be located in either the head or the workers. Our investigation revealed that when the job agent is in a worker, the job will be left pending and then eventually fail.

This is because the worker attempts to utilize the head service IP as the GCS address. Although the head initially uses the external IP as the GCS address, it fails to connect to the GCS when using the GcsAioClient because it fails to reach the service IP (GCS address) from its own pod.

In addition, the choose_agent code to choose the job_agent caches the job_agent reference. That is the reason why if the first job fails, the coming up jobs always have the same problem.

To address this problem, this PR makes the Ray head service headless, allowing the Ray head to access the service IP from its own pod.

Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants