-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Client] - Client server failed with runtime_env container #29852
Comments
cc @SongGuyang in case there are any workarounds for container issue. As another possible workaround, you mentioned |
Hey @architkulkarni, thanks for your quick reply. Yes, it's an alternative. I'm curious though. Does preinstalling the conda environment on the head node makes it shareable with new workers? If not it would take a reasonable amount of time to install the conda environment on new workers unless otherwise installed on the base image of the cluster. The problem at the moment is that we try to work with a more generic cluster that attends multiple projects through the use of runtime environments. |
Ah no, you would need the conda environment to be on all the nodes of the cluster and have the same name on all nodes. |
@architkulkarni I am experiencing issues when trying to test the Alpha Container Runtime feature. Is podman a necessary dependency? I noticed the container runtime is specified in the code (see this issue: #29665). We are using Kuberay + Cri-o as our container runtime on kubernetes. Is the expectation for this feature to have the autoscaler launch a new worker? Does this work natively with existing Kubernetes architectures? We don't have Podman installed nor use it: |
Hi @peterghaddad , I believe |
Thanks for the response @architkulkarni. So the worker is what pulls the actual image? i.e an image runs within an image when using Kuberay? It may make sense to have an integration for Kuberay where it launches a new Pod with the image specified, installs environments dependencies, then kicks off a job. Food for thought, but think this would be robust when running on K8 environments! |
Is there any update on this? I have exactly the same problem but don't have a workaround unfortunately |
Here's a bit more info.I'm attempting to start a job from a python interactive environment. It's important to do it this way as jobs will eventually be submitted by the Prefect job schedular which itegrates to ray via prefect-ray. Here is the python code I am using: import ray
import time
import logging
from ray.runtime_env import RuntimeEnv
logger = logging.getLogger()
env = RuntimeEnv(container={
"image": "europe-west2-docker.pkg.dev/<GCP_PROJECT>/test-docker/test-prefect-ray:0.0.1b1",
"run_options": ["--log-level=debug"]
})
ray.init("ray://<Server-IP>:10001", runtime_env=env)
@ray.remote
def square(x):
logger.warning('Example log')
return x * x
start = time.time()
object_references = [
square.remote(item) for item in range(8)
]
data = ray.get(object_references)
print(data) I have one node at the moment, the head node, which is a GCP Virtual Machine, started with LogsThere's not too much useful information I can see in the logs, as far as I can tell the container is being downloaded on the head node, and from there is struggling to reach the ray server on the VM (the same machine the container is running on). Starting this container manually I am at least able to ping the host IP from a container bash session. Output from
Output from
|
I am facing the same issue with Is there any workaround or pending bug fix ? |
@zcin could you take this one? |
What happened + What you expected to happen
Hi,
Even though runtime_env containers are still experimental, I've been having success using them at the job level in ray applications launched inside the cluster with the job submission. i.e. the script that runs on the cluster does
ray.init(runtime_env={'container': ...})
. That being said, I don't think there's anything wrong with the podman setup on my custom cluster images, inherited fromrayproject/ray:2.0.0-py38
.However, using runtime_env containers with ray client for interactive development leads to the following errors in the initialization of the ray client server.
The file
ray_client_server_23000.err
containsI can find more info on
ray_client_server.err
,Also on
runtime_env_setup-ray_client_server_23000.log
I could findI think this issue is related to the connection between the client proxy and client server that seems to run in the container, however, as stated in the logs, the container is created with
--net host
flag. I wonder if someone from the ray team could point me towards a workaround, or some documentation regarding the setup of the client servers as I am willing to contribute.Regarding issue severity, I'll leave it at
Medium
since my only alternatives are:Thanks,.
Versions / Dependencies
About ray
Podman installed on cluster base image
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: