-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error #1173
Conversation
cc @anshulomar @msumitjain would you mind reviewing this PR? Thanks! |
Hi @rickyyx,
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed log of the debugging process.
The code looks good to me. I'll defer to @sihanwang41 / @zcin to confirm whether the overall approach makes sense (send requests to head node, not agent)
cc @sihanwang41 would you mind reviewing this PR? Thanks! |
I think that might not happen since the agent should only be discoverable after node registration is done. (But I could be wrong). What's the implication of this variance? I prob need to take a look at the code to be 100% sure. |
Hi @rickyyx, thank you for the reply! Would you mind explaining more about "discoverable"? Context:
|
How does kuberay know of the worker node's agent address? Is this something derivable outside of ray? (e.g. from node's ip address, and a port being passed in?) If so, yes, I think it's possible for the request to arrive on the worker node before node registration is done (which will lead to the original error stack, where the GCS is not able to find the node) |
KubeRay uses a Kubernetes service for the dashboard agents using a round-robin algorithm to evenly distribute traffic among the available Pods, including the head Pod and worker Pods.
Thank you for the confirmation! |
Thank you for picking this much needed fix. Apologies for being late in the review party. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevin85421 for making this change! Looks good to me.
…0 Internal Server Error (ray-project#1173) [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error
Why are these changes needed?
Trace code
#1125 provides the aforementioned error message. This user is using Ray 2.3.0, so I traced the code of Ray 2.3.0 instead of 2.5.0.
GET $DASHBOARD_AGENT_SERVICE:52365/api/serve/deployments/status
request to retrieve Serve status.get_all_deployment_statuses
function in the Serve agent. You can find the code for this function here.get_all_deployment_statuses
uses@optional_utils.init_ray_and_catch_exceptions()
as a decorator. This decorator checks whether Ray is initialized or not. If Ray is not initialized, the decorator will callray.init()
.ray.init()
in worker.py invokes the following function to initialize a Ray node. In addition, we can determine that this node is a worker node based on thehead=False
parameter.class Node
(node.py) calls the functionray._private.services.get_node_to_connect_for_driver
to retrieve address information from GCS.global_state.get_node_to_connect_for_driver(node_ip_address)
.get_node_to_connect_for_driver(self, node_ip_address)
get_node_to_connect_for_driver
GetNodeToConnectForDriver
=> The error message is reported by this functionRoot cause of #1125
To summarize, the root cause is that KubeRay sends a request to the dashboard agent process on a worker node, but GCS does not have the address information for this Raylet. This means that the dashboard agent process starts serving requests before the node registration process with GCS finishes. See the "Node management" section in the Ray architecture whitepaper for more details.
KubeRay will start sending requests to the dashboard agent processes once the head Pod is running and ready. In other words, KubeRay does not check the status of workers, so it is possible for the dashboard agent processes on workers to receive requests at any moment. See #1074 for more details.
Solution
Without this PR, the Kubernetes service for the dashboard agent uses a round-robin algorithm to evenly distribute traffic among the available Pods, including the head Pod and worker Pods. You can refer to this link for more details. Hence, this PR decides to only send requests to the head Pod by changing from the dashboard agent service to the head service.
Something needs to be verified.
Related issue number
Closes #1125
Checks
The following experiment is used to verify the statement "KubeRay will start sending requests to the dashboard agent processes once the head Pod is running and ready. In other words, KubeRay does not check the status of workers, so it is possible for the dashboard agent processes on workers to receive requests at any moment.".