Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the observability of the init container #1149

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Jun 8, 2023

Why are these changes needed?

The current version presents a challenge for users troubleshooting issues related to the init container, mainly due to the absence of logs from this container. Typically, error messages from the health check are suppressed because, if the GCS server is not ready, these messages can erroneously suggest a problem, which could lead to user confusion.

This Pull Request addresses these concerns by:

  1. Only If the initialization process exceeds 120 seconds, will the init container produces logs.

  2. If the prolonged initialization is not caused by any real error but caused by heavy commands executed before the 'ray start' in the head pod, then once GSC server is ready, a log message will inform the users that any preceding 'Connection Refused' error messages can be safely ignored. This message should help users discern between genuine errors and the normal output during a slow startup.

Checks

kubectl apply -f /home/ubuntu/workspace/kuberay/ray-operator/config/samples/ray-cluster.complete.yaml
kubectl logs -f $(kubectl get pods -o=name | grep worker) wait-gcs-ready
#GCS is ready.
kubectl apply -f /home/ubuntu/workspace/kuberay/ray-operator/config/samples/ray-cluster.complete.yaml


# Open a new terminal, keep killing the GCS server as head pod may restart after continue failing to connect to GCS.
# stop killing the gcs_server after 120s.
while true; do
kubectl exec -it $(kubectl get pods -o=name | grep head) --  pkill gcs_server
sleep 1
done


kubectl logs -f $(kubectl get pods -o=name | grep worker) wait-gcs-ready
# 2 seconds elapsed: Waiting for GCS to be ready.
# 8 seconds elapsed: Waiting for GCS to be ready.
# 15 seconds elapsed: Waiting for GCS to be ready.
# 22 seconds elapsed: Waiting for GCS to be ready.
# 28 seconds elapsed: Waiting for GCS to be ready.
# 35 seconds elapsed: Waiting for GCS to be ready.
# 41 seconds elapsed: Waiting for GCS to be ready.
# 48 seconds elapsed: Waiting for GCS to be ready.
# 54 seconds elapsed: Waiting for GCS to be ready.
# 61 seconds elapsed: Waiting for GCS to be ready.
# 67 seconds elapsed: Waiting for GCS to be ready.
# 74 seconds elapsed: Waiting for GCS to be ready.
# 80 seconds elapsed: Waiting for GCS to be ready.
# 87 seconds elapsed: Waiting for GCS to be ready.
# 93 seconds elapsed: Waiting for GCS to be ready.
# 100 seconds elapsed: Waiting for GCS to be ready.
# 107 seconds elapsed: Waiting for GCS to be ready.
# 113 seconds elapsed: Waiting for GCS to be ready.
# 120 seconds elapsed: Waiting for GCS to be ready.
# Traceback (most recent call last):
#   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 124, in check_health
#     resp = stub.CheckAlive(req, timeout=timeout)
#   File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
#     return _end_unary_response_blocking(state, call, False, None)
#   File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
#     raise _InactiveRpcError(state)
# grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
#         status = StatusCode.UNAVAILABLE
#         details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.96.22.131:6379: Failed to connect to remote host: Connection refused"
#         debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.96.22.131:6379: Failed to connect to remote host: Connection refused {created_time:"2023-07-08T21:34:31.546276638-07:00", grpc_status:14}"
# >
# 126 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.
# ......
# GCS is ready. Any error messages above can be safely ignored.
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421 kevin85421 merged commit 753429d into ray-project:master Jul 11, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Improve the observability of the init container
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants