Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][KubeRay] Add description tables for RayCluster Status in the observability doc #47462

Merged
merged 7 commits into from
Sep 11, 2024
47 changes: 45 additions & 2 deletions doc/source/cluster/kubernetes/user-guides/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,56 @@ kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

Use this command to redirect the operator's logs to a file called `operator-log`. Then search for errors in the file.

### Method 2: Check custom resource status
### Method 2: Check the status and events of custom resources.

```bash
kubectl describe [raycluster|rayjob|rayservice] $CUSTOM_RESOURCE_NAME -n $YOUR_NAMESPACE
```

After running this command, check the status and events of the custom resource for any errors.
After running this command, check events and the `state`, and `conditions` in the status of the custom resource for any errors and progresses.
rueian marked this conversation as resolved.
Show resolved Hide resolved


#### RayCluster `.Status.State`

The `.Status.State` field, which currently represents the cluster's situation, will be deprecated in the future due to its limited representation. Please use the new `Status.Conditions` as an alternative.
rueian marked this conversation as resolved.
Show resolved Hide resolved

| State | Description |
|-----------|----------------------------------------------------------------------------------------------------------------------------------------|
| Ready | The state will be set to `Ready` once all the Pods in the cluster are ready. The `State` will remain `Ready` until the cluster is suspended. |
rueian marked this conversation as resolved.
Show resolved Hide resolved
| Suspended | The state will be set to `Suspended` when `Spec.Suspend` is set to true and all Pods in the cluster have been deleted. |
rueian marked this conversation as resolved.
Show resolved Hide resolved



#### RayCluster `.Status.Conditions`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that .Status.Conditions has a feature flag. Can you explain how to enable this feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might help also add a warning header here that it's an alpha feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both added.

image


Although `Status.State` can represent the cluster situation, it is still only a single field. By enabling the feature gate `RayClusterStatusConditions` on the KubeRay v1.2.1, you can access to new `Status.Conditions` for more detailed cluster history and states.
rueian marked this conversation as resolved.
Show resolved Hide resolved

:::{warning}
`RayClusterStatusConditions` is still an alpha feature and may undergo changes in the future.
rueian marked this conversation as resolved.
Show resolved Hide resolved
:::

If you deployed KubeRay with Helm, then enable the `RayClusterStatusConditions` gate in the `featureGates` of your Helm values.

```bash
helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.1 \
--set featureGates\[0\].name=RayClusterStatusConditions \
--set featureGates\[0\].enabled=true
```

Or, just make your kuberay operator executable run with `--feature-gates=RayClusterStatusConditions=true` argument.
rueian marked this conversation as resolved.
Show resolved Hide resolved

| Type | Status | Reason | Description |
|--------------------------|--------|--------------------------------|----------------------------------------------------------------------------------------------------------------------|
| RayClusterProvisioned | True | AllPodRunningAndReadyFirstTime | Once all the Pods in the cluster are ready, this condition will be set to `True` and will remain `True` even if some Pods fail later. |
rueian marked this conversation as resolved.
Show resolved Hide resolved
| | False | RayClusterPodsProvisioning | |
| RayClusterReplicaFailure | True | FailedDeleteAllPods | This condition will be set to `True` when there is a reconciliation error, otherwise the condition will be cleared. |
rueian marked this conversation as resolved.
Show resolved Hide resolved
| | True | FailedDeleteHeadPod | Please refer to the `Reason` and the `Message` of the condition for more detailed debugging information. |
rueian marked this conversation as resolved.
Show resolved Hide resolved
| | True | FailedCreateHeadPod | |
| | True | FailedDeleteWorkerPod | |
| | True | FailedCreateWorkerPod | |
| HeadPodReady | True | HeadPodRunningAndReady | This condition will be `True` only if the HeadPod is currently ready; otherwise, it will be `False`. |
rueian marked this conversation as resolved.
Show resolved Hide resolved
| | False | HeadPodNotFound | |


### Method 3: Check logs of Ray Pods

Expand Down