Skip to content

Latest commit

 

History

History
186 lines (151 loc) · 4.01 KB

README.md

File metadata and controls

186 lines (151 loc) · 4.01 KB

vLLM Helm Chart

A Helm chart for deploying vLLM. vLLM is a fast and easy-to-use library for LLM inference and serving.

Usage

Basic usage:

helm repo add substratusai https://substratusai.github.io/helm
helm install mistral-7b-instruct substratusai/vllm \
  --set model=mistralai/Mistral-7B-Instruct-v0.1 \
  --set resources.limits."nvidia\.com/gpu"=1

Mistral 7B Instruct on GCP targeting 1 x L4 GPU

Create a file named values.yaml with following content:

model: mistralai/Mistral-7B-Instruct-v0.1
resources:
  limits:
    nvidia.com/gpu: 1
nodeSelector:
  cloud.google.com/gke-accelerator: nvidia-l4

Install using Helm:

helm install mistral-7b-instruct substratusai/vllm \
    -f values.yaml

Mistral 7B Instruct on GKE Autopilot with ReadManyOnly PVC to store model

Create a K8s Job to load the model into a PVC:

kubectl apply -f https://raw.githubusercontent.com/substratusai/helm/main/charts/vllm/examples/load-model-job-mistral-7b-instruct.yaml

Create a file named values.yaml with following content:

# This example requires first running
# `kubectl apply -f load-model-job-mistral-7b-instruct.yaml` to load the model into a PVC.
model: /model
servedModelName: mistral-7b-instruct-v0.1

replicaCount: 0

deploymentAnnotations:
  lingo.substratus.ai/models: mistral-7b-instruct-v0.1
  lingo.substratus.ai/min-replicas: "0" # needs to be string
  lingo.substratus.ai/max-replicas: "3" # needs to be string

readManyPVC:
  enabled: true
  sourcePVC: "mistral-7b-instruct"
  mountPath: /model
  size: 20Gi

nodeSelector:
  cloud.google.com/gke-accelerator: nvidia-l4

resources:
  requests:
    cpu: 7
    memory: 24Gi
    ephemeral-storage: 10Gi
  limits:
    nvidia.com/gpu: 1

Install using Helm:

helm install mistral-7b-instruct substratusai/vllm \
    -f values.yaml

Mistral 7B instruct quantized using awq

Create a file named values.yaml with following content: embedmd:# (examples/mistral-7b-awq-values.yaml)

model: TheBloke/Mistral-7B-Instruct-v0.1-AWQ
quantization: awq
dtype: half
maxModelLen: 4096
resources:
  limits:
    nvidia.com/gpu: 1

Install using Helm:

helm install mistral-7b-instruct-awq substratusai/vllm \
    -f values.yaml

Default Values

Take a look at the default values.yaml:

replicaCount: 1
# Change this if you want to serve another model
model: mistralai/Mistral-7B-Instruct-v0.1
# optional, defaults to model name
servedModelName: ""
# optional, choose awq or squeezellm
quantization: ""
# optional, only required to be set to half when using quantization
dtype: ""
# Optional, default is 0.90
gpuMemoryUtilization: ""

# this only works on GKE today
readManyPVC:
  enabled: false
  # provide the name of the PVC that has the model
  sourcePVC: ""
  accessModes:
  - ReadOnlyMany
  mountPath: /model
  size: 30Gi
  # storageClass needs to match the sourcePVC storageClass
  storageClass: ""

deploymentAnnotations: {}

# Override the resources if you need more
resources:
  requests:
    cpu: 500m
    memory: "512Mi"

# Override env variables
env: {}
port: 8080

# Add nodeSelectors to target specific GPU types
nodeSelector: {}
#  E.g. for GCP L4 cloud.google.com/gke-accelerator: nvidia-l4

image:
  repository: substratusai/vllm
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

podAnnotations: {}

podSecurityContext: {}

securityContext: {}

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: false
  className: ""
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

tolerations: []

affinity: {}