-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Created first running version for llama2-chat deployment in kserve
- Loading branch information
Showing
9 changed files
with
4,303 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
FROM nvcr.io/nvidia/pytorch:23.09-py3 | ||
RUN apt-get update && apt-get install -y libglib2.0-0 | ||
RUN mkdir /app | ||
WORKDIR /app | ||
COPY pyproject.toml /app/ | ||
COPY llama2_chat /app/llama2_chat | ||
RUN python -m pip install . | ||
ADD main.py ./main.py | ||
ENV PORT=8080 | ||
EXPOSE 8080 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# Llama2 Deployment | ||
|
||
For deploying the Llama2 model, we will create a custom predictor using KServe. KServe provides a way to implement a custom model server, enabling you to bring your own models and pre/post-processing code. If you are new to creating custom predictors with KServe, you may read more about it in the [KServe documentation](https://kserve.github.io/website/0.8/modelserving/v1beta1/custom/custom_model/). | ||
|
||
For a gentle introduction and detailed guide on working with Llama2, consider reading this blog post: [How to prompt Llama](https://replicate.com/blog/how-to-prompt-llama). | ||
|
||
## Hugging Face Token | ||
|
||
You'll need a Hugging Face token to utilize the transformers library for downloading | ||
the model weights. Additionally, you'll need to ask Meta for permission to use the | ||
Llama model itself. You can read more about that [here](https://huggingface.co/blog/llama2). | ||
|
||
The code expects an environment variable named `HF_ACCESS_TOKEN` that holds the token as a string. | ||
|
||
### Setting the Variable Locally | ||
|
||
To set this variable for a local deployment, run: | ||
```sh | ||
export HF_ACCESS_TOKEN=<example-token> | ||
``` | ||
|
||
### Creating a Kubernetes Secret | ||
|
||
We've created a file called hf-token-secret.yaml in this directory to create a | ||
Kubernetes secret that holds the access token (which we'll need for the kserve | ||
deployment later on). First, encode the token to base64 by running: | ||
```sh | ||
echo -n <example-token> | base64 | ||
``` | ||
|
||
Now, the output of the above command can be used to create the Kubernetes secret: | ||
```sh | ||
sed 's/HF_ACCESS_TOKEN/<example-token-base64>/' hf-token-secret.yaml | kubectl apply -n <your-namespace> -f - | ||
``` | ||
|
||
## Run/develop Locally | ||
|
||
To run it locally, you'll need to first install the Python dependencies. To | ||
do so, run the following from this directory: | ||
``` | ||
poetry install . | ||
export HF_ACCESS_TOKEN=<example-token> | ||
poetry run python main.py --name "llama2-chat" --device "cpu" --hf-model-string "meta-llama/Llama-2-13b-chat-hf" | ||
``` | ||
|
||
### Run Query Against Local Deployment | ||
|
||
Once the server is up and running, you might try something like: | ||
```sh | ||
curl localhost:8080/v1/models/llama2-chat:predict -d '{"top_k": 3, "max_length": 200, "instances": ["Why is MLOps so important?"]}' | ||
``` | ||
We decided to allow the user to set the inference options for top_k (number of | ||
samples with the best scores, from which we sample) and max_length in the API query, along with the prompt. | ||
|
||
## Build and Push Image | ||
|
||
To build the model using Docker, do: | ||
```sh | ||
docker build -t <your-image-registry-name>/llama2-chat:<TAG> . | ||
``` | ||
|
||
To use this image in kserve, we'll also need to push it to an image registry. | ||
First, log in: | ||
```sh | ||
docker login <example.registry.com>:4567 -u <token name> -p <token> | ||
``` | ||
|
||
Note: This may require setting up a project access token in the GitLab UI first. | ||
Navigate to: GitLab UI -> this Repository -> Settings -> Access Token -> Add a project access token. | ||
|
||
Now, you can push the image using Docker: | ||
```sh | ||
docker push <image-repository-name>/llama2-chat:<TAG> | ||
``` | ||
|
||
## Deploy on KServe | ||
|
||
Ensure there is a secret called hf-token-secret in your namespace and run this | ||
to create the KServe inference service: | ||
``` | ||
sed 's/IMAGE_REGISTRY/<your-image-registry>/;s/TAG/<your-tag>/' predictor.yaml | kubectl create -n <your-namespace> -f - | ||
``` | ||
Note that you might need to change the command in the `prediction.yaml` to match your hardware. | ||
|
||
### Run Query Against the KServe Deployment | ||
|
||
First, wait for the API to be ready; you can observe the progress and logs, etc., | ||
in the Kubeflow UI under models. Once it's ready, you can copy the internal URL from | ||
the Kubeflow UI and try: | ||
```sh | ||
curl <internal-inference-url> -d '{"top_k": 3, "max_length": 100, "instances": ["Why is MLOps so important?"]}' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: hf-token-secret | ||
type: Opaque | ||
data: | ||
HF_TOKEN: HF_ACCESS_TOKEN | ||
|
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
from transformers import AutoTokenizer | ||
import transformers | ||
import torch | ||
import os | ||
import json | ||
from loguru import logger | ||
from huggingface_hub import login | ||
from typing import Dict, Union | ||
from kserve import ( | ||
Model, | ||
InferRequest, | ||
InferResponse, | ||
) | ||
|
||
import os | ||
import json | ||
import transformers | ||
from typing import Union, Dict | ||
from kserve import Model, InferRequest, InferResponse | ||
from transformers import AutoTokenizer | ||
|
||
|
||
class Llama2ChatPredictor(Model): | ||
def __init__( | ||
self, | ||
name: str, | ||
hf_model_string: str = "meta-llama/Llama-2-7b-chat-hf", | ||
loguru_loglevel: str = "INFO", | ||
device: Union[int, str] = "auto", | ||
): | ||
""" | ||
Initialize the Llama2ChatPredictor. | ||
Parameters: | ||
name: The name of the model. | ||
hf_model_string: The identifier of the Hugging Face model. | ||
loguru_loglevel: Logging level for loguru. | ||
device: The device to run the model on (0, 1, ..., 'cpu', 'gpu'). | ||
""" | ||
super().__init__(name=name) | ||
os.environ["LOGURU_LEVEL"] = loguru_loglevel | ||
self.device = device | ||
self.hf_model_string = hf_model_string | ||
self.load() | ||
self.ready = True | ||
|
||
def load(self): | ||
""" | ||
Load the model weights and tokenizer from Hugging Face and create the pipeline. | ||
""" | ||
login(token=os.environ["HF_ACCESS_TOKEN"]) | ||
logger.info(f"Loading weights and tokenizer for model: {self.hf_model_string}") | ||
self.tokenizer = AutoTokenizer.from_pretrained(self.hf_model_string) | ||
self.tokenizer.pad_token = self.tokenizer.eos_token | ||
self.pipeline = transformers.pipeline( | ||
"text-generation", | ||
model=self.hf_model_string, | ||
torch_dtype=torch.float32 | ||
if not torch.cuda.is_available() | ||
else torch.float16, | ||
device=self.device, | ||
) | ||
|
||
|
||
def preprocess( | ||
self, | ||
payload: Union[bytes, InferRequest], | ||
headers: Dict[str, str] = None, | ||
) -> Dict: | ||
""" | ||
Preprocess the payload into the format expected by the predict method. | ||
Parameters: | ||
payload: The input payload, as a bytes string or InferRequest. | ||
headers: Additional headers (unused here, but expected in base class). | ||
Returns: | ||
the payload as a dict | ||
""" | ||
return payload if type(payload) == dict else json.loads(payload) | ||
|
||
|
||
def predict( | ||
self, | ||
payload: Dict, | ||
headers: Dict[str, str] = None, | ||
) -> Union[str, InferResponse]: | ||
""" | ||
Generate predictions using the input payload. | ||
Parameters: | ||
payload: The input payload, preprocessed to dict format. | ||
headers: Additional headers (unused here, but expected in base class). | ||
Returns: | ||
A string containing the generated text. | ||
""" | ||
sequences = self.pipeline( | ||
f"{payload['instances']}\n", | ||
do_sample=True, | ||
top_k=payload["top_k"], | ||
num_return_sequences=1, | ||
eos_token_id=self.tokenizer.eos_token_id, | ||
max_length=payload["max_length"], | ||
) | ||
return sequences[0]["generated_text"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
import click | ||
import torch | ||
from typing import Union | ||
from loguru import logger | ||
from kserve import ( | ||
ModelServer, | ||
) | ||
from llama2_chat.predictor import Llama2ChatPredictor | ||
|
||
|
||
@click.command() | ||
@click.option("--name", type=str, default="llama2-chat", help="Model name.") | ||
@click.option("--loglevel", type=str, default="INFO", help="Log level for loguru.") | ||
@click.option( | ||
"--hf-model-string", | ||
type=str, | ||
default="meta-llama/Llama-2-13b-chat-hf", | ||
help="Index of the model in the hugging face model registry.", | ||
) | ||
@click.option( | ||
"--device", | ||
default="cpu", | ||
help="Device (0, 1, ..., 'cpu', 'cuda').", | ||
) | ||
def main(name: str, loglevel: str, hf_model_string: str, device: Union[str, int]): | ||
""" | ||
Main function to initiate and start the KServe Model Server. | ||
Parameters: | ||
name: The name of the model. | ||
loglevel: The logging level for loguru. | ||
hf_model_string: Identifier for the Hugging Face model. | ||
device: Computational device to utilize (0, 1, 2, ..., 'cpu', 'gpu'). | ||
""" | ||
logger.info(f"Cuda is available? -> {torch.cuda.is_available()}") | ||
device = int(device) if device not in ('cpu', 'gpu', 'mps') else device | ||
kserve_model = Llama2ChatPredictor( | ||
name=name, | ||
hf_model_string=hf_model_string, | ||
loguru_loglevel=loglevel, | ||
device=device, | ||
) | ||
ModelServer().start([kserve_model]) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.