ray-project · kevin85421 · Jul 10, 2023 · Jul 9, 2023 · Jul 9, 2023 · Jul 10, 2023
diff --git a/docs/guidance/FAQ.md b/docs/guidance/FAQ.md
@@ -3,50 +3,57 @@
 Welcome to the Frequently Asked Questions page for KubeRay. This document addresses common inquiries.
 If you don't find an answer to your question here, please don't hesitate to connect with us via our [community channels](https://github.com/ray-project/kuberay#getting-involved).
 
-## Contents
-- [Worker Init Container](#worker-init-container)
-- [cluster domain](#cluster-domain)
+# Contents
+- [Worker init container](#worker-init-container)
+- [Cluster domain](#cluster-domain)
+- [RayService](#rayservice)
 
-### Worker Init Container
+## Worker init container
 
-When starting a RayCluster, the worker Pod needs to wait until the head Pod is started in order to connect to the head successfully.
-To achieve this, the KubeRay operator will automatically inject an [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into the worker Pod to wait for the head Pod to be ready before starting the worker container. The init container will continuously check if the head's GCS server is ready or not.
+The KubeRay operator will inject a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod. 
+This init container is responsible for waiting until the Global Control Service (GCS) on the head Pod is ready before establishing a connection to the head.
+The init container will use `ray health-check` to check the GCS server status continuously.
 
-Related questions:
-- [Why are my worker Pods stuck in `Init:0/1` status, how can I troubleshoot the worker init container?](#why-are-my-worker-pods-stuck-in-init01-status-how-can-i-troubleshoot-the-worker-init-container)
-- [I do not want to use the default worker init container, how can I disable the auto-injection and add my own?](#i-do-not-want-to-use-the-default-worker-init-container-how-can-i-disable-the-auto-injection-and-add-my-own)
+The default worker init container may not work for all use cases, or users may want to customize the init container.
 
-### Cluster Domain
+### 1. Init container troubleshooting
 
-Each Kubernetes cluster is assigned a unique cluster domain during installation. This domain helps differentiate between names local to the cluster and external names. The `cluster_domain` can be customized as outlined in the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction). The default value for `cluster_domain` is `cluster.local`.
+Some common causes for the worker init container to stuck in `Init:0/1` status are:
 
-The cluster domain plays a critical role in service discovery and inter-service communication within the cluster. It is part of the Fully Qualified Domain Name (FQDN) for services within the cluster. See [here](https://github.com/kubernetes/website/blob/main/content/en/docs/concepts/services-networking/dns-pod-service.md#aaaaa-records-1) for examples. In the context of KubeRay, workers use the FQDN of the head service to establish a connection to the head.
+* The GCS server process has failed in the head Pod. Please inspect the log directory `/tmp/ray/session_latest/logs/` in the head Pod for errors related to the GCS server.
+* The `ray` executable is not included in the `$PATH` for the image, so the init container will fail to run `ray health-check`.
+* The `CLUSTER_DOMAIN` environment variable is not set correctly. See the section [cluster domain](#cluster-domain) for more details.
+* The worker init container shares the same ***ImagePullPolicy***, ***SecurityContext***, ***Env***, ***VolumeMounts***, and ***Resources*** as the worker Pod template. Sharing these settings is possible to cause a deadlock. See [#1130](https://github.com/ray-project/kuberay/issues/1130) for more details.
 
-Related questions:
-- [How can I set a custom cluster domain if mine is not `cluster.local`?](#how-can-i-set-a-custom-cluster-domain-if-mine-is-not-clusterlocal)
+If the init container remains stuck in `Init:0/1` status for 2 minutes, we will stop redirecting the output messages to `/dev/null` and instead print them to the worker Pod logs.
+To troubleshoot further, you can inspect the logs using `kubectl logs`.
 
+### 2. Disable the init container injection
 
-## Questions
-
-### Why are my worker Pods stuck in `Init:0/1` status, how can I troubleshoot the worker init container?
-
-Worker Pods might be stuck in `Init:0/1` status for several reasons. The default worker init container only progresses when the GCS server in the head Pod is ready. Here are some common causes for the issue:
-- The GCS server process failed in the head Pod. Inspect the head Pod logs for errors related to the GCS server.
-- Ray is not included in the `$PATH` in the worker init container. The init container uses `ray health-check` to check the GCS server status.
-- The cluster domain is not set correctly. See [cluster-domain](#cluster-domain) for more details. The init container uses the Fully Qualified Domain Name (FQDN) of the head service to connect to the GCS server.
-- The worker init container shares the same ImagePullPolicy, SecurityContext, Env, VolumeMounts, and Resources as the worker Pod template. Any setting requiring a sidecar container could lead to a deadlock. Refer to [issue 1130](https://github.com/ray-project/kuberay/issues/1130) for additional details.
+If you want to customize the worker init container, you can disable the init container injection and add your own.
+To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable from KubeRay v0.5.2).
+Please refer to [#1069](https://github.com/ray-project/kuberay/pull/1069) and the [KubeRay Helm chart](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L83-L87) for instructions on how to set the environment variable.
+Once disabled, you can add your custom init container to the worker Pod template.
 
-If none of the above reasons apply, you can troubleshoot by disabling the default worker init container injection and adding your test init container to the worker Pod template.
+## Cluster domain
 
+In KubeRay, we use Fully Qualified Domain Names (FQDNs) to establish connections between workers and the head.
+The FQDN of the head service is `${HEAD_SVC}.${NAMESPACE}.svc.${CLUSTER_DOMAIN}`.
+The default [cluster domain](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction) is `cluster.local`, which works for most Kubernetes clusters.
+However, it's important to note that some clusters may have a different cluster domain.
+You can check the cluster domain of your Kubernetes cluster by checking `/etc/resolv.conf` in a Pod.
 
-### I do not want to use the default worker init container, how can I disable the auto-injection and add my own?
+To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator.
+Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L88-L91).
+For more information, please refer to [#951](https://github.com/ray-project/kuberay/pull/951) and [#938](https://github.com/ray-project/kuberay/pull/938) for more details.
 
-The default worker init container is used to wait for the GCS server in the head Pod to be ready. It is defined [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L207). To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable only for versions after 0.5.0). Helm chart users can make this change [here](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml#L74). Once disabled, you can add your custom init container to the worker Pod template. More details can be found in [PR 1069](https://github.com/ray-project/kuberay/pull/1069).
+## RayService
 
+RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then
+create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts 
+or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. See [rayservice-troubleshooting](rayservice-troubleshooting.md) for more details.
 
-### How can I set the custom cluster domain if mine is not `cluster.local`?
-
-To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator. Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml#L78).
+## Questions
 
 ### Why are my changes to RayCluster/RayJob CR not taking effect?
 

diff --git a/docs/guidance/rayservice-troubleshooting.md b/docs/guidance/rayservice-troubleshooting.md
@@ -39,14 +39,14 @@ kubectl port-forward $RAY_POD -n $YOUR_NAMESPACE --address 0.0.0.0 8265:8265
 
 For more details about Ray Serve observability on the dashboard, you can refer to [the documentation](https://docs.ray.io/en/latest/ray-observability/getting-started.html#serve-view) and [the YouTube video](https://youtu.be/eqXfwM641a4).
 
-### Common issues
+## Common issues
 
-#### Issue 1: Ray Serve script is incorrect.
+### Issue 1: Ray Serve script is incorrect.
 
 We strongly recommend that you test your Ray Serve script locally or in a RayCluster before
 deploying it to a RayService. [TODO: https://github.com/ray-project/kuberay/issues/1176]
 
-#### Issue 2: `serveConfigV2` is incorrect.
+### Issue 2: `serveConfigV2` is incorrect.
 
 For the sake of flexibility, we have set `serveConfigV2` as a YAML multi-line string in the RayService CR.
 This implies that there is no strict type checking for the Ray Serve configurations in `serveConfigV2` field.
@@ -56,7 +56,7 @@ Some tips to help you debug the `serveConfigV2` field:
 the Ray Serve Multi-application API `PUT "/api/serve/applications/"`.
 * Unlike `serveConfig`, `serveConfigV2` adheres to the snake case naming convention. For example, `numReplicas` is used in `serveConfig`, while `num_replicas` is used in `serveConfigV2`. 
 
-#### Issue 3-1: The Ray image does not include the required dependencies.
+### Issue 3-1: The Ray image does not include the required dependencies.
 
 You have two options to resolve this issue:
 
@@ -65,7 +65,7 @@ You have two options to resolve this issue:
   * For example, the MobileNet example requires `python-multipart`, which is not included in the Ray image `rayproject/ray-ml:2.5.0`.
 Therefore, the YAML file includes `python-multipart` in the runtime environment. For more details, refer to [the MobileNet example](mobilenet-rayservice.md).
 
-#### Issue 3-2: Examples for troubleshooting dependency issues.
+### Issue 3-2: Examples for troubleshooting dependency issues.
 
 > Note: We highly recommend testing your Ray Serve script locally or in a RayCluster before deploying it to a RayService. This helps identify any dependency issues in the early stages. [TODO: https://github.com/ray-project/kuberay/issues/1176]
 
@@ -106,7 +106,7 @@ The function `__call__()` will only be called when the Serve application receive
             ModuleNotFoundError: No module named 'tensorflow'
     ```
 
-#### Issue 4: Incorrect `import_path`.
+### Issue 4: Incorrect `import_path`.
 
 You can refer to [the documentation](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ServeApplicationSchema.html#ray.serve.schema.ServeApplicationSchema.import_path) for more details about the format of `import_path`.
 Taking [the MobileNet YAML file](../../ray-operator/config/samples/ray-service.mobilenet.yaml) as an example,
@@ -122,4 +122,31 @@ and `app` is the name of the variable representing Ray Serve application within
         runtime_env:
           working_dir: "https://github.com/ray-project/serve_config_examples/archive/b393e77bbd6aba0881e3d94c05f968f05a387b96.zip"
           pip: ["python-multipart==0.0.6"]
-```
+```
+
+### Issue 5: Fail to create / update Serve applications.
+
+You may encounter the following error message when KubeRay tries to create / update Serve applications:
+
+```
+Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused
+```
+
+For RayService, the KubeRay operator submits a request to the RayCluster for creating Serve applications once the head Pod is ready.
+It's important to note that the Dashboard and GCS may take a few seconds to start up after the head Pod is ready.
+As a result, the request may fail a few times initially before the necessary components are fully operational.
+
+If you continue to encounter this issue after 1 minute, there are several possible causes:
+
+* The Dashboard and dashboard agent failed to start up due to some reasons. You can check the `dashboard.log` and `dashboard_agent.log` files located at `/tmp/ray/session_latest/logs/` on the head Pod for more information.
+
+* There is a Kubernetes NetworkPolicy blocking the traffic between the KubeRay operator and the dashboard agent port (i.e., 52365). Please review your NetworkPolicy configuration.
+
+### Issue 6: `runtime_env`
+
+In `serveConfigV2`, you can specify the runtime environment for the Ray Serve applications via `runtime_env`.
+Some common issues related to `runtime_env`:
+
+* The `working_dir` points to a private AWS S3 bucket, but the Ray Pods do not have the necessary permissions to access the bucket.
+
+* The NetworkPolicy blocks the traffic between the Ray Pods and the external URLs specified in `runtime_env`.
diff --git a/docs/index.md b/docs/index.md
@@ -54,14 +54,7 @@ by some organizations to back user interfaces for KubeRay resource management.
 
 ## Security
 
-If you discover a potential security issue in this project, or think you may
-have discovered a security issue, we ask that you notify KubeRay Security via our
-[Slack Channel](https://ray-distributed.slack.com/archives/C02GFQ82JPM).
-Please do **not** create a public GitHub issue.
-
-## License
-
-This project is licensed under the [Apache-2.0 License](LICENSE).
+Please report security issues to [email protected].
 
 ## The Ray docs
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -51,9 +51,9 @@ nav:
     - Best Practices:
       - Executing Commands: guidance/pod-command.md
       - Worker Reconnection: best-practice/worker-head-reconnection.md
-  - Troubleshooting: troubleshooting.md
-  - Design:
-    - Core API and Backend Service: design/protobuf-grpc-service.md
+  - Troubleshooting:
+    - FAQ: guidance/FAQ.md
+    - RayService Troubleshooting: guidance/rayservice-troubleshooting.md
   - Development:
     - Developer Guide: development/development.md
     - Release Process: