Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running RayDP on GPU machine stuck at GCS readiness #336

Open
chenya-zhang opened this issue Apr 19, 2023 · 1 comment
Open

Running RayDP on GPU machine stuck at GCS readiness #336

chenya-zhang opened this issue Apr 19, 2023 · 1 comment

Comments

@chenya-zhang
Copy link

chenya-zhang commented Apr 19, 2023

Hi folks!

We found that RayDP did not seem to be compatible with the nvidia base image on GPU machines after multiple tries.

For example, the below is our simple Docker image:

FROM nvidia/cuda:11.5.0-devel-ubuntu18.04

# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# Install Python and pip

RUN set -ex; \
  apt update && apt install curl gpg -y --force-yes 
RUN apt-get update && apt-get install -y --allow-unauthenticated python3.8-dev 
RUN apt-get install -y g++ python3-distutils
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.8 99  &&\
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 99
RUN curl https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py && python3.8 /tmp/get-pip.py
RUN mkdir -p /root

# Install Ray dependencies
RUN python3.8 -m  pip install ray raydp notebook

When deploying a raycluster, the worker k8s pod stuck at init Started container wait-gcs-ready.
If checking the pod log of container wait-gcs-ready:

mesg: ttyname failed: Inappropriate ioctl for device
wait for GCS to be ready
wait for GCS to be ready
wait for GCS to be ready

If we remove installing raydp, there is no issue with ray and the k8s pod runs well.

Curious if any possible components in RayDP that might cause this incompatibility?

@kira-lin
Copy link
Collaborator

Hi,
Sorry for late reply. As mentioned before, we might not give accurate suggestions about this, because we haven't tried things like this. I do not understand why it'll stuck on GCS, since that's a component of Ray, and should have nothing to do with RayDP. Ray should be ready before you start RayDP. Only installing RayDP should not cause a problem.

Maybe you can try installing raydp-nightly, and see if that makes any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants