Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Unable to run aviary on V100 GPU. #44

Open
AmoghM opened this issue Aug 23, 2023 · 1 comment
Open

Unable to run aviary on V100 GPU. #44

AmoghM opened this issue Aug 23, 2023 · 1 comment

Comments

@AmoghM
Copy link

AmoghM commented Aug 23, 2023

I was trying to run llama-2 on a machine with V100 GPU.

I ran aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml inside the docker container and got a stack trace:

(HTTPProxyActor pid=2448) INFO 2023-08-22 23:38:44,774 http_proxy 172.17.0.2 http_proxy.py:904 - Proxy actor f5a0692e60801e1b0ef45a8301000000 starting on node 57297f3255438333c74bdc7b75d3fd3aa4b1c48e7bdcf6d07db72a41.
[INFO 2023-08-22 23:38:44,824] api.py: 320  Started detached Serve instance in namespace "serve".
(HTTPProxyActor pid=2448) INFO:     Started server process [2448]
[INFO 2023-08-22 23:38:44,951] api.py: 300  Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ServeController pid=2420) INFO 2023-08-22 23:38:44,942 controller 2420 deployment_state.py:1319 - Deploying new version of deployment meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,046 controller 2420 deployment_state.py:1586 - Adding 1 replica to deployment meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,083 controller 2420 deployment_state.py:1319 - Deploying new version of deployment router_Router.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,187 controller 2420 deployment_state.py:1586 - Adding 2 replicas to deployment router_Router.
(ServeReplica:router_Router pid=2480) There was a problem when trying to write in your cache folder (/home/jupyter/cache/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(autoscaler +15s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +15s) Error: No available node types can fulfill resource request {'accelerator_type_a10': 0.01, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(ServeController pid=2420) WARNING 2023-08-22 23:39:15,112 controller 2420 deployment_state.py:1889 - Deployment "meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"accelerator_type_a10": 0.01, "CPU": 1}, resources available: {"CPU": 14.0}.
(ServeReplica:router_Router pid=2479) There was a problem when trying to write in your cache folder (/home/jupyter/cache/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(autoscaler +50s) Error: No available node types can fulfill resource request {'accelerator_type_a10': 0.01, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Is aviary incompatible with V100 GPUs?

@shahrukhx01
Copy link

shahrukhx01 commented Oct 10, 2023

@AmoghM Could you try after replacing accelerator_type_a10: 0.01 to accelerator_type_v100: 1 in the meta-llama--Llama-2-7b-chat-hf.yaml file.

For more details, please see #23

alanwguo pushed a commit that referenced this issue Jan 25, 2024
Refactor deploy logic
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants