Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Improve FAQ page and RayService troubleshooting guide #1225

Merged
merged 4 commits into from
Jul 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 36 additions & 29 deletions docs/guidance/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,50 +3,57 @@
Welcome to the Frequently Asked Questions page for KubeRay. This document addresses common inquiries.
If you don't find an answer to your question here, please don't hesitate to connect with us via our [community channels](https://github.com/ray-project/kuberay#getting-involved).

## Contents
- [Worker Init Container](#worker-init-container)
- [cluster domain](#cluster-domain)
# Contents
- [Worker init container](#worker-init-container)
- [Cluster domain](#cluster-domain)
- [RayService](#rayservice)

### Worker Init Container
## Worker init container

When starting a RayCluster, the worker Pod needs to wait until the head Pod is started in order to connect to the head successfully.
To achieve this, the KubeRay operator will automatically inject an [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into the worker Pod to wait for the head Pod to be ready before starting the worker container. The init container will continuously check if the head's GCS server is ready or not.
The KubeRay operator will inject a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod.
This init container is responsible for waiting until the Global Control Service (GCS) on the head Pod is ready before establishing a connection to the head.
The init container will use `ray health-check` to check the GCS server status continuously.

Related questions:
- [Why are my worker Pods stuck in `Init:0/1` status, how can I troubleshoot the worker init container?](#why-are-my-worker-pods-stuck-in-init01-status-how-can-i-troubleshoot-the-worker-init-container)
- [I do not want to use the default worker init container, how can I disable the auto-injection and add my own?](#i-do-not-want-to-use-the-default-worker-init-container-how-can-i-disable-the-auto-injection-and-add-my-own)
The default worker init container may not work for all use cases, or users may want to customize the init container.

### Cluster Domain
### 1. Init container troubleshooting

Each Kubernetes cluster is assigned a unique cluster domain during installation. This domain helps differentiate between names local to the cluster and external names. The `cluster_domain` can be customized as outlined in the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction). The default value for `cluster_domain` is `cluster.local`.
Some common causes for the worker init container to stuck in `Init:0/1` status are:

The cluster domain plays a critical role in service discovery and inter-service communication within the cluster. It is part of the Fully Qualified Domain Name (FQDN) for services within the cluster. See [here](https://github.com/kubernetes/website/blob/main/content/en/docs/concepts/services-networking/dns-pod-service.md#aaaaa-records-1) for examples. In the context of KubeRay, workers use the FQDN of the head service to establish a connection to the head.
* The GCS server process has failed in the head Pod. Please inspect the log directory `/tmp/ray/session_latest/logs/` in the head Pod for errors related to the GCS server.
* The `ray` executable is not included in the `$PATH` for the image, so the init container will fail to run `ray health-check`.
* The `CLUSTER_DOMAIN` environment variable is not set correctly. See the section [cluster domain](#cluster-domain) for more details.
* The worker init container shares the same ***ImagePullPolicy***, ***SecurityContext***, ***Env***, ***VolumeMounts***, and ***Resources*** as the worker Pod template. Sharing these settings is possible to cause a deadlock. See [#1130](https://github.com/ray-project/kuberay/issues/1130) for more details.

Related questions:
- [How can I set a custom cluster domain if mine is not `cluster.local`?](#how-can-i-set-a-custom-cluster-domain-if-mine-is-not-clusterlocal)
If the init container remains stuck in `Init:0/1` status for 2 minutes, we will stop redirecting the output messages to `/dev/null` and instead print them to the worker Pod logs.
To troubleshoot further, you can inspect the logs using `kubectl logs`.

### 2. Disable the init container injection

## Questions

### Why are my worker Pods stuck in `Init:0/1` status, how can I troubleshoot the worker init container?

Worker Pods might be stuck in `Init:0/1` status for several reasons. The default worker init container only progresses when the GCS server in the head Pod is ready. Here are some common causes for the issue:
- The GCS server process failed in the head Pod. Inspect the head Pod logs for errors related to the GCS server.
- Ray is not included in the `$PATH` in the worker init container. The init container uses `ray health-check` to check the GCS server status.
- The cluster domain is not set correctly. See [cluster-domain](#cluster-domain) for more details. The init container uses the Fully Qualified Domain Name (FQDN) of the head service to connect to the GCS server.
- The worker init container shares the same ImagePullPolicy, SecurityContext, Env, VolumeMounts, and Resources as the worker Pod template. Any setting requiring a sidecar container could lead to a deadlock. Refer to [issue 1130](https://github.com/ray-project/kuberay/issues/1130) for additional details.
If you want to customize the worker init container, you can disable the init container injection and add your own.
To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable from KubeRay v0.5.2).
Please refer to [#1069](https://github.com/ray-project/kuberay/pull/1069) and the [KubeRay Helm chart](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L83-L87) for instructions on how to set the environment variable.
Once disabled, you can add your custom init container to the worker Pod template.

If none of the above reasons apply, you can troubleshoot by disabling the default worker init container injection and adding your test init container to the worker Pod template.
## Cluster domain

In KubeRay, we use Fully Qualified Domain Names (FQDNs) to establish connections between workers and the head.
The FQDN of the head service is `${HEAD_SVC}.${NAMESPACE}.svc.${CLUSTER_DOMAIN}`.
The default [cluster domain](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction) is `cluster.local`, which works for most Kubernetes clusters.
However, it's important to note that some clusters may have a different cluster domain.
You can check the cluster domain of your Kubernetes cluster by checking `/etc/resolv.conf` in a Pod.

### I do not want to use the default worker init container, how can I disable the auto-injection and add my own?
To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator.
Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L88-L91).
For more information, please refer to [#951](https://github.com/ray-project/kuberay/pull/951) and [#938](https://github.com/ray-project/kuberay/pull/938) for more details.

The default worker init container is used to wait for the GCS server in the head Pod to be ready. It is defined [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L207). To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable only for versions after 0.5.0). Helm chart users can make this change [here](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml#L74). Once disabled, you can add your custom init container to the worker Pod template. More details can be found in [PR 1069](https://github.com/ray-project/kuberay/pull/1069).
## RayService

RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts
or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. See [rayservice-troubleshooting](rayservice-troubleshooting.md) for more details.

### How can I set the custom cluster domain if mine is not `cluster.local`?

To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator. Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml#L78).
## Questions

### Why are my changes to RayCluster/RayJob CR not taking effect?

Expand Down
41 changes: 34 additions & 7 deletions docs/guidance/rayservice-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,14 @@ kubectl port-forward $RAY_POD -n $YOUR_NAMESPACE --address 0.0.0.0 8265:8265

For more details about Ray Serve observability on the dashboard, you can refer to [the documentation](https://docs.ray.io/en/latest/ray-observability/getting-started.html#serve-view) and [the YouTube video](https://youtu.be/eqXfwM641a4).

### Common issues
## Common issues

#### Issue 1: Ray Serve script is incorrect.
### Issue 1: Ray Serve script is incorrect.

We strongly recommend that you test your Ray Serve script locally or in a RayCluster before
deploying it to a RayService. [TODO: https://github.com/ray-project/kuberay/issues/1176]

#### Issue 2: `serveConfigV2` is incorrect.
### Issue 2: `serveConfigV2` is incorrect.

For the sake of flexibility, we have set `serveConfigV2` as a YAML multi-line string in the RayService CR.
This implies that there is no strict type checking for the Ray Serve configurations in `serveConfigV2` field.
Expand All @@ -56,7 +56,7 @@ Some tips to help you debug the `serveConfigV2` field:
the Ray Serve Multi-application API `PUT "/api/serve/applications/"`.
* Unlike `serveConfig`, `serveConfigV2` adheres to the snake case naming convention. For example, `numReplicas` is used in `serveConfig`, while `num_replicas` is used in `serveConfigV2`.

#### Issue 3-1: The Ray image does not include the required dependencies.
### Issue 3-1: The Ray image does not include the required dependencies.

You have two options to resolve this issue:

Expand All @@ -65,7 +65,7 @@ You have two options to resolve this issue:
* For example, the MobileNet example requires `python-multipart`, which is not included in the Ray image `rayproject/ray-ml:2.5.0`.
Therefore, the YAML file includes `python-multipart` in the runtime environment. For more details, refer to [the MobileNet example](mobilenet-rayservice.md).

#### Issue 3-2: Examples for troubleshooting dependency issues.
### Issue 3-2: Examples for troubleshooting dependency issues.

> Note: We highly recommend testing your Ray Serve script locally or in a RayCluster before deploying it to a RayService. This helps identify any dependency issues in the early stages. [TODO: https://github.com/ray-project/kuberay/issues/1176]

Expand Down Expand Up @@ -106,7 +106,7 @@ The function `__call__()` will only be called when the Serve application receive
ModuleNotFoundError: No module named 'tensorflow'
```

#### Issue 4: Incorrect `import_path`.
### Issue 4: Incorrect `import_path`.

You can refer to [the documentation](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ServeApplicationSchema.html#ray.serve.schema.ServeApplicationSchema.import_path) for more details about the format of `import_path`.
Taking [the MobileNet YAML file](../../ray-operator/config/samples/ray-service.mobilenet.yaml) as an example,
Expand All @@ -122,4 +122,31 @@ and `app` is the name of the variable representing Ray Serve application within
runtime_env:
working_dir: "https://github.com/ray-project/serve_config_examples/archive/b393e77bbd6aba0881e3d94c05f968f05a387b96.zip"
pip: ["python-multipart==0.0.6"]
```
```

### Issue 5: Fail to create / update Serve applications.

You may encounter the following error message when KubeRay tries to create / update Serve applications:

```
Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused
```

For RayService, the KubeRay operator submits a request to the RayCluster for creating Serve applications once the head Pod is ready.
It's important to note that the Dashboard and GCS may take a few seconds to start up after the head Pod is ready.
As a result, the request may fail a few times initially before the necessary components are fully operational.

If you continue to encounter this issue after 1 minute, there are several possible causes:

* The Dashboard and dashboard agent failed to start up due to some reasons. You can check the `dashboard.log` and `dashboard_agent.log` files located at `/tmp/ray/session_latest/logs/` on the head Pod for more information.

* There is a Kubernetes NetworkPolicy blocking the traffic between the KubeRay operator and the dashboard agent port (i.e., 52365). Please review your NetworkPolicy configuration.

### Issue 6: `runtime_env`

In `serveConfigV2`, you can specify the runtime environment for the Ray Serve applications via `runtime_env`.
Some common issues related to `runtime_env`:

* The `working_dir` points to a private AWS S3 bucket, but the Ray Pods do not have the necessary permissions to access the bucket.

* The NetworkPolicy blocks the traffic between the Ray Pods and the external URLs specified in `runtime_env`.
9 changes: 1 addition & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,7 @@ by some organizations to back user interfaces for KubeRay resource management.

## Security

If you discover a potential security issue in this project, or think you may
have discovered a security issue, we ask that you notify KubeRay Security via our
[Slack Channel](https://ray-distributed.slack.com/archives/C02GFQ82JPM).
Please do **not** create a public GitHub issue.

## License

This project is licensed under the [Apache-2.0 License](LICENSE).
Please report security issues to [email protected].

## The Ray docs

Expand Down
6 changes: 3 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,9 @@ nav:
- Best Practices:
- Executing Commands: guidance/pod-command.md
- Worker Reconnection: best-practice/worker-head-reconnection.md
- Troubleshooting: troubleshooting.md
- Design:
- Core API and Backend Service: design/protobuf-grpc-service.md
- Troubleshooting:
- FAQ: guidance/FAQ.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it be better to make the FAQ top-level instead of under Troubleshooting? Currently it seems all the questions are troubleshooting-related, but my guess is we'll probably add some non-troubleshooting questions to the FAQ in the future.

- RayService Troubleshooting: guidance/rayservice-troubleshooting.md
- Development:
- Developer Guide: development/development.md
- Release Process:
Expand Down