Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Client] - Client server failed with runtime_env container #29852

Open
igorgad opened this issue Oct 31, 2022 · 10 comments
Open

[Ray Client] - Client server failed with runtime_env container #29852

igorgad opened this issue Oct 31, 2022 · 10 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-runtime-env Issues related to Ray environment dependencies P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared

Comments

@igorgad
Copy link

igorgad commented Oct 31, 2022

What happened + What you expected to happen

Hi,

Even though runtime_env containers are still experimental, I've been having success using them at the job level in ray applications launched inside the cluster with the job submission. i.e. the script that runs on the cluster does ray.init(runtime_env={'container': ...}). That being said, I don't think there's anything wrong with the podman setup on my custom cluster images, inherited from rayproject/ray:2.0.0-py38.

However, using runtime_env containers with ray client for interactive development leads to the following errors in the initialization of the ray client server.

---------------------------------------------------------------------------
ConnectionAbortedError                    Traceback (most recent call last)
Cell In [2], line 3
      1 import ray
----> 3 ray.init('ray://localhost:10001', runtime_env={
      4     'container': {
      5             'image': 'docker.io/rayproject/ray:2.0.0-py38',
      6             'run_options': ['--cgroups=enabled'],
      7         },
      8 })

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/worker.py:1248, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
   1246 passed_kwargs.update(kwargs)
   1247 builder._init_args(**passed_kwargs)
-> 1248 ctx = builder.connect()
   1249 from ray._private.usage import usage_lib
   1251 if passed_kwargs.get("allow_multiple") is True:

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/client_builder.py:178, in ClientBuilder.connect(self)
    175 if self._allow_multiple_connections:
    176     old_ray_cxt = ray.util.client.ray.set_context(None)
--> 178 client_info_dict = ray.util.client_connect.connect(
    179     self.address,
    180     job_config=self._job_config,
    181     _credentials=self._credentials,
    182     ray_init_kwargs=self._remote_init_kwargs,
    183     metadata=self._metadata,
    184 )
    185 get_dashboard_url = ray.remote(ray._private.worker.get_dashboard_url)
    186 dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client_connect.py:47, in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
     42 _explicitly_enable_client_mode()
     44 # TODO(barakmich): https://github.com/ray-project/ray/issues/13274
     45 # for supporting things like cert_path, ca_path, etc and creating
     46 # the correct metadata
---> 47 conn = ray.connect(
     48     conn_str,
     49     job_config=job_config,
     50     secure=secure,
     51     metadata=metadata,
     52     connection_retries=connection_retries,
     53     namespace=namespace,
     54     ignore_version=ignore_version,
     55     _credentials=_credentials,
     56     ray_init_kwargs=ray_init_kwargs,
     57 )
     58 return conn

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:252, in RayAPIStub.connect(self, *args, **kw_args)
    250 def connect(self, *args, **kw_args):
    251     self.get_context()._inside_client_test = self._inside_client_test
--> 252     conn = self.get_context().connect(*args, **kw_args)
    253     global _lock, _all_contexts
    254     with _lock:

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:102, in _ClientContext.connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
     94 self.client_worker = Worker(
     95     conn_str,
     96     secure=secure,
   (...)
     99     connection_retries=connection_retries,
    100 )
    101 self.api.worker = self.client_worker
--> 102 self.client_worker._server_init(job_config, ray_init_kwargs)
    103 conn_info = self.client_worker.connection_info()
    104 self._check_versions(conn_info, ignore_version)

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/worker.py:838, in Worker._server_init(self, job_config, ray_init_kwargs)
    830     response = self.data_client.Init(
    831         ray_client_pb2.InitRequest(
    832             job_config=serialized_job_config,
   (...)
    835         )
    836     )
    837     if not response.ok:
--> 838         raise ConnectionAbortedError(
    839             f"Initialization failure from server:\n{response.msg}"
    840         )
    842 except grpc.RpcError as e:
    843     raise decode_exception(e)

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 685, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

The file ray_client_server_23000.err contains

Trying to pull docker.io/rayproject/ray:2.0.0-py38...
Getting image source signatures
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying config sha256:c3b4447db3d173fcc94d5736ee633a6223ef07efc15a2ba1c69a34f673f6c299
Writing manifest to image destination
Storing signatures
2022-10-31 05:37:33,217	INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:23000
2022-10-31 05:37:38,239	INFO server.py:922 -- 25 idle checks before shutdown.
2022-10-31 05:37:43,249	INFO server.py:922 -- 20 idle checks before shutdown.
2022-10-31 05:37:48,260	INFO server.py:922 -- 15 idle checks before shutdown.
2022-10-31 05:37:53,272	INFO server.py:922 -- 10 idle checks before shutdown.
2022-10-31 05:37:58,282	INFO server.py:922 -- 5 idle checks before shutdown.

I can find more info on ray_client_server.err,

2022-10-31 05:36:33,435	INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:10001
2022-10-31 05:36:48,552	INFO proxier.py:670 -- New data connection from client 71aa1ee5efa1441b937aecb493ed977f: 
2022-10-31 05:36:48,566	INFO proxier.py:229 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "docker.io/rayproject/ray:2.0.0-py38", "run_options": ["--cgroups=enabled"]}}.
2022-10-31 05:38:03,708	ERROR proxier.py:332 -- SpecificServer startup failed for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708	INFO proxier.py:340 -- SpecificServer started on port: 23000 with PID: 229 for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708	ERROR proxier.py:681 -- Server startup failed for client: 71aa1ee5efa1441b937aecb493ed977f, using JobConfig: <ray.job_config.JobConfig object at 0x7f85ec1ee460>!
2022-10-31 05:38:03,709	INFO proxier.py:390 -- Specific server 71aa1ee5efa1441b937aecb493ed977f is no longer running, freeing its port 23000
2022-10-31 05:38:33,710	ERROR proxier.py:379 -- Timeout waiting for channel for 71aa1ee5efa1441b937aecb493ed977f
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 374, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-10-31 05:38:33,711	WARNING proxier.py:777 -- Retrying Logstream connection. 1 attempts failed.
2022-10-31 05:38:33,712	INFO proxier.py:742 -- 71aa1ee5efa1441b937aecb493ed977f last started stream at 1667219808.5511196. Current stream started at 1667219808.5511196.
2022-10-31 05:38:35,713	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:35,713	WARNING proxier.py:777 -- Retrying Logstream connection. 2 attempts failed.
2022-10-31 05:38:37,715	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:37,715	WARNING proxier.py:777 -- Retrying Logstream connection. 3 attempts failed.
2022-10-31 05:38:39,717	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:39,717	WARNING proxier.py:777 -- Retrying Logstream connection. 4 attempts failed.
2022-10-31 05:38:41,719	ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:41,719	WARNING proxier.py:777 -- Retrying Logstream connection. 5 attempts failed

Also on runtime_env_setup-ray_client_server_23000.log I could find

12022-10-31 05:36:48,569	INFO container.py:47 -- start worker in container with prefix: podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=154 --cgroups=enabled --entrypoint python docker.io/rayproject/ray:2.0.0-py38

I think this issue is related to the connection between the client proxy and client server that seems to run in the container, however, as stated in the logs, the container is created with --net host flag. I wonder if someone from the ray team could point me towards a workaround, or some documentation regarding the setup of the client servers as I am willing to contribute.

Regarding issue severity, I'll leave it at Medium since my only alternatives are:

  • Pack everything in the cluster image, which is a bit limiting for my setup
  • Use conda and wait up to 10 minutes for dependency install

Thanks,.

Versions / Dependencies

About ray

ray[default]==2.0.0
kuberay-operator: kuberay/operator:v0.3.0

Podman installed on cluster base image

(base) ray@lany-cluster-head-bvkg6:~$ podman info
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.2, commit: '
  cpus: 8
  distribution:
    codename: focal
    distribution: ubuntu
    version: "20.04"
  eventLogger: file
  hostname: lany-cluster-head-bvkg6
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 100
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.10.133+
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 27025526784
  memTotal: 33671999488
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: ea1fe3938eefa14eb707f1d22adff4db670645d6
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /tmp/podman-run-1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 0
  swapTotal: 0
  uptime: 283h 18m 10.55s (Approximately 11.79 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/ray/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 0
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.5
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/ray/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 1
  runRoot: /tmp/podman-run-1000/containers
  volumePath: /home/ray/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.2
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 3.4.2

Reproduction script

import ray

ray.init('ray://localhost:10001', runtime_env={
    'container': {
            'image': 'docker.io/rayproject/ray:2.0.0-py38',
            'run_options': ['--cgroups=enabled'],
        },
})

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@igorgad igorgad added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 31, 2022
@architkulkarni architkulkarni added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 1, 2022
@architkulkarni architkulkarni added this to the runtime_env backlog milestone Nov 1, 2022
@architkulkarni
Copy link
Contributor

architkulkarni commented Nov 1, 2022

cc @SongGuyang in case there are any workarounds for container issue.

As another possible workaround, you mentioned conda takes 10 minutes to install. If the conda environment isn't changing often, would it fit your use case to preinstall the conda environment and then just specify the name of the existing environment in the runtime_env? E.g. runtime_env={"conda": "my-existing-env"}. Then it would just be activating the existing environment at runtime instead of installing, so it should be faster.

@igorgad
Copy link
Author

igorgad commented Nov 1, 2022

Hey @architkulkarni, thanks for your quick reply.

Yes, it's an alternative. I'm curious though. Does preinstalling the conda environment on the head node makes it shareable with new workers? If not it would take a reasonable amount of time to install the conda environment on new workers unless otherwise installed on the base image of the cluster. The problem at the moment is that we try to work with a more generic cluster that attends multiple projects through the use of runtime environments.

@architkulkarni
Copy link
Contributor

Ah no, you would need the conda environment to be on all the nodes of the cluster and have the same name on all nodes.

@peterghaddad
Copy link
Contributor

@architkulkarni I am experiencing issues when trying to test the Alpha Container Runtime feature. Is podman a necessary dependency? I noticed the container runtime is specified in the code (see this issue: #29665).

We are using Kuberay + Cri-o as our container runtime on kubernetes. Is the expectation for this feature to have the autoscaler launch a new worker? Does this work natively with existing Kubernetes architectures?

We don't have Podman installed nor use it: bash: line 0: exec: podman: not found

@architkulkarni
Copy link
Contributor

Hi @peterghaddad , I believe podman is required. You might be able to find some more details in this thread, but support is limited at the moment: https://discuss.ray.io/t/how-to-use-container-in-runtime-environments/6175/11 I don't expect that this feature has any special compatibility with Kubernetes. Like other runtime_env fields such as conda, this feature would be for worker processes, not nodes launched by the autoscaler (which are also unfortunately called "workers"), so it shouldn't have any interaction with the autoscaler.

@peterghaddad
Copy link
Contributor

peterghaddad commented Dec 15, 2022

Thanks for the response @architkulkarni. So the worker is what pulls the actual image? i.e an image runs within an image when using Kuberay? It may make sense to have an integration for Kuberay where it launches a new Pod with the image specified, installs environments dependencies, then kicks off a job. Food for thought, but think this would be robust when running on K8 environments!

@edoakes edoakes added the core-runtime-env Issues related to Ray environment dependencies label Mar 23, 2023
@jjyao jjyao added the core Issues that should be addressed in Ray Core label Mar 28, 2023
@arneyjfs
Copy link

Is there any update on this? I have exactly the same problem but don't have a workaround unfortunately

@arneyjfs
Copy link

arneyjfs commented Mar 26, 2024

Here's a bit more info.

I'm attempting to start a job from a python interactive environment. It's important to do it this way as jobs will eventually be submitted by the Prefect job schedular which itegrates to ray via prefect-ray. Here is the python code I am using:

import ray
import time
import logging
from ray.runtime_env import RuntimeEnv

logger = logging.getLogger()

env = RuntimeEnv(container={
    "image": "europe-west2-docker.pkg.dev/<GCP_PROJECT>/test-docker/test-prefect-ray:0.0.1b1",
    "run_options": ["--log-level=debug"]
})


ray.init("ray://<Server-IP>:10001", runtime_env=env)


@ray.remote
def square(x):
    logger.warning('Example log')
    return x * x


start = time.time()
object_references = [
    square.remote(item) for item in range(8)
]
data = ray.get(object_references)
print(data)

I have one node at the moment, the head node, which is a GCP Virtual Machine, started with ray start --head --port=6379 --dashboard-host=<Server-IP>

Logs

There's not too much useful information I can see in the logs, as far as I can tell the container is being downloaded on the head node, and from there is struggling to reach the ray server on the VM (the same machine the container is running on). Starting this container manually I am at least able to ping the host IP from a container bash session.

Output from ray_client_server_23000.err

... ^ truncated ^ ...
time="2024-03-26T11:02:16Z" level=debug msg="running conmon: /usr/libexec/podman/conmon" args="[--api-version 1 -c 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 -u 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 -r /usr/bin/crun -b /home/jamesarney/.local/share/containers/storage/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata -p /run/user/1006/containers/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata/pidfile -n focused_buck --exit-dir /run/user/1006/libpod/tmp/exits --full-attach -l journald --log-level debug --syslog --conmon-pidfile /run/user/1006/containers/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/jamesarney/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1006/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /run/user/1006/libpod/tmp --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35]"
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

time="2024-03-26T11:02:16Z" level=info msg="Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for cpu: mkdir /sys/fs/cgroup/cpu/conmon: permission denied"
time="2024-03-26T11:02:16Z" level=debug msg="Received: 73931"
time="2024-03-26T11:02:16Z" level=info msg="Got Conmon PID as 73928"
time="2024-03-26T11:02:16Z" level=debug msg="Created container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 in OCI runtime"
time="2024-03-26T11:02:16Z" level=debug msg="Attaching to container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35"
time="2024-03-26T11:02:16Z" level=debug msg="Starting container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 with command [python -m ray.util.client.server --address=10.128.0.52:6379 --host=0.0.0.0 --port=23000 --mode=specific-server]"
time="2024-03-26T11:02:16Z" level=debug msg="Started container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35"
time="2024-03-26T11:02:16Z" level=debug msg="Enabling signal proxying"
2024-03-26 11:02:18,161	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:23000, args Namespace(host='0.0.0.0', port=23000, mode='specific-server', address='10.128.0.52:6379', redis_password=None, runtime_env_agent_address=None)
2024-03-26 11:02:23,208	INFO server.py:930 -- 25 idle checks before shutdown.
2024-03-26 11:02:28,221	INFO server.py:930 -- 20 idle checks before shutdown.
2024-03-26 11:02:33,233	INFO server.py:930 -- 15 idle checks before shutdown.
2024-03-26 11:02:38,244	INFO server.py:930 -- 10 idle checks before shutdown.
2024-03-26 11:02:43,256	INFO server.py:930 -- 5 idle checks before shutdown.
time="2024-03-26T11:02:48Z" level=debug msg="Called run.PersistentPostRunE(podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --userns=keep-id --env RAY_RAYLET_PID=69972 --env RAY_JOB_ID= --env RAY_CLIENT_MODE=0 --env RAY_LD_PRELOAD=1 --env RAY_NODE_ID=f1810c0e0436d3671a5d97bfd1583d77408d9605b7a186f6be6bb733 --env RAY_enable_pipe_based_agent_to_parent_health_check=1 --log-level=debug --entrypoint python europe-west2-docker.pkg.dev/biocortex-project/test-docker/test-prefect-ray:0.0.1b1 -m ray.util.client.server --address=10.128.0.52:6379 --host=0.0.0.0 --port=23000 --mode=specific-server)"

Output from ray_client_server_23000.err

2024-03-25 19:25:36,955	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='10.128.0.52:6379', redis_password=None, runtime_env_agent_address='http://10.128.0.52:56619')
2024-03-26 11:02:15,537	INFO proxier.py:696 -- New data connection from client afda9a422aa8463fad3f5dcf1f09ebe3:
2024-03-26 11:02:15,553	INFO proxier.py:223 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "europe-west2-docker.pkg.dev/biocortex-project/test-docker/test-prefect-ray:0.0.1b1", "run_options": ["--log-level=debug"]}}.
2024-03-26 11:02:48,668	ERROR proxier.py:333 -- SpecificServer startup failed for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:02:48,669	INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 73886 for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:02:48,669	ERROR proxier.py:707 -- Server startup failed for client: afda9a422aa8463fad3f5dcf1f09ebe3, using JobConfig: <ray.job_config.JobConfig object at 0x7f0c26b8c490>!
2024-03-26 11:02:56,925	INFO proxier.py:391 -- Specific server afda9a422aa8463fad3f5dcf1f09ebe3 is no longer running, freeing its port 23000
2024-03-26 11:03:18,673	ERROR proxier.py:380 -- Timeout waiting for channel for afda9a422aa8463fad3f5dcf1f09ebe3
Traceback (most recent call last):
  File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
    self._block(timeout)
  File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-03-26 11:03:18,677	INFO proxier.py:768 -- afda9a422aa8463fad3f5dcf1f09ebe3 last started stream at 1711450935.384032. Current stream started at 1711450935.384032.
2024-03-26 11:03:18,678	WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
2024-03-26 11:03:20,680	ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:20,681	WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed.
2024-03-26 11:03:22,683	ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:22,683	WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed.
2024-03-26 11:03:24,685	ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:24,686	WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed.
2024-03-26 11:03:26,688	ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:26,689	WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.

@tanguy-s
Copy link

I am facing the same issue with ray==2.22.0 on ubuntu 22.04.
Podman version : 4.6.2

Is there any workaround or pending bug fix ?

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed P2 Important issue, but not time-critical labels Jun 18, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jun 24, 2024

@zcin could you take this one?

@jjyao jjyao added P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-runtime-env Issues related to Ray environment dependencies P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared
Projects
None yet
Development

No branches or pull requests

9 participants