Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][KubeRay] Add description tables for RayCluster Status in the observability doc #47462

Merged
merged 7 commits into from
Sep 11, 2024
47 changes: 45 additions & 2 deletions doc/source/cluster/kubernetes/user-guides/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,56 @@ kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

Use this command to redirect the operator's logs to a file called `operator-log`. Then search for errors in the file.

### Method 2: Check custom resource status
### Method 2: Check the status and events of custom resources

```bash
kubectl describe [raycluster|rayjob|rayservice] $CUSTOM_RESOURCE_NAME -n $YOUR_NAMESPACE
```

After running this command, check the status and events of the custom resource for any errors.
After running this command, check events and the `state`, and `conditions` in the status of the custom resource for any errors and progress.


#### RayCluster `.Status.State`

The `.Status.State` field, which currently represents the cluster's situation, will be deprecated in the future due to its limited representation. Use the new `Status.Conditions` instead.

| State | Description |
|-----------|----------------------------------------------------------------------------------------------------------------------------------------|
| Ready | KubeRay sets the state to `Ready` once all the Pods in the cluster are ready. The `State` remains `Ready` until KubeRay suspends the cluster. |
| Suspended | KubeRay sets the state to `Suspended` when it sets `Spec.Suspend` to true and deletes all Pods in the cluster. |



#### RayCluster `.Status.Conditions`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that .Status.Conditions has a feature flag. Can you explain how to enable this feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might help also add a warning header here that it's an alpha feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both added.

image


Although `Status.State` can represent the cluster situation, it's still only a single field. By enabling the feature gate `RayClusterStatusConditions` on the KubeRay v1.2.1, you can access to new `Status.Conditions` for more detailed cluster history and states.

:::{warning}
`RayClusterStatusConditions` is still an alpha feature and may change in the future.
:::

If you deployed KubeRay with Helm, then enable the `RayClusterStatusConditions` gate in the `featureGates` of your Helm values.

```bash
helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.1 \
--set featureGates\[0\].name=RayClusterStatusConditions \
--set featureGates\[0\].enabled=true
```

Or, just make your KubeRay Operator executable run with `--feature-gates=RayClusterStatusConditions=true` argument.

| Type | Status | Reason | Description |
|--------------------------|--------|--------------------------------|----------------------------------------------------------------------------------------------------------------------|
| RayClusterProvisioned | True | AllPodRunningAndReadyFirstTime | Once all the Pods in the cluster are ready, this condition is set to `True` and remains `True` even if some Pods fail later. |
| | False | RayClusterPodsProvisioning | |
| RayClusterReplicaFailure | True | FailedDeleteAllPods | KubeRay sets this condition to `True` when there's a reconciliation error, otherwise KubeRay clears the condition. |
| | True | FailedDeleteHeadPod | See the `Reason` and the `Message` of the condition for more detailed debugging information. |
| | True | FailedCreateHeadPod | |
| | True | FailedDeleteWorkerPod | |
| | True | FailedCreateWorkerPod | |
| HeadPodReady | True | This condition is `True` only if the HeadPod is currently ready; otherwise, it's `False`. |
| | False | HeadPodNotFound | |


### Method 3: Check logs of Ray Pods

Expand Down