This is worker code which uses ExllamaV2 for inference on Runpod Serverless.
- Clone this repository
- build docker image
- push docker image to your docker registry
- deploy to Runpod Serverless
docker build -t <your docker registry>/<your docker image name>:<your docker image tag> .
These are the build arguments:
Note: The model gets downloaded on first run so these models get downloaded on runtime instead of on build. Make sure you are attaching network volume to serverless endpoint so download only happens once and the cache is used for subsequent runs
key | value | optional |
---|---|---|
HUGGING_FACE_HUB_TOKEN | your huggingface token | true |
MODEL_NAME | your model name | false |
MODEL_REVISION | your model revision | true |
MODEL_BASE_PATH | your model base path | true |
LORA_ADAPTER_NAME | your lora adapter name | true |
LORA_ADAPTER_REVISION | your lora adapter revision | true |
docker push <your docker registry>/<your docker image name>:<your docker image tag>
After having docker image on your docker registry, you can deploy to Runpod Serverless.