Add Lora Dynamic switching for inference #71

Jeevi10 · 2024-08-14T16:07:01Z

Dynamic LoRA (Low-Rank Adaptation) switching functionality, allowing users to change LoRA models on-the-fly during inference without reloading the entire model.

StephennFernandes · 2024-08-17T10:08:20Z

@Jeevi10 hey can you link some resources on Dynamic LoRA specifically for whisper , mainly how this type of inference works and how to use LoRA to finetune whisper

Jeevi10 · 2024-08-26T18:33:55Z

@StephennFernandes Thank you for your reply.

Resources for dynamic lora:

https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#run-bart-with-lora
https://github.com/cccntu/minLoRA/tree/main
https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints
https://github.com/S-LoRA/S-LoRA

I have provided some example repos where I got the idea from. Unfortunately I don't see any specific implementations for whisper directly.

Just to provide you an idea I created running example using huggingface transformers and peft,

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from peft import PeftModel
import torch_tensorrt

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
)

base_model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

peft_model_id = "path to checkpoint adapter 1"
peft_model_id_2 = "path to checkpoint adapter2"
model = PeftModel.from_pretrained(base_model, peft_model_id, adapter_name='adapter 1', device_map="auto")
model.load_adapter(peft_model_id_2, adapter_name='adapter 2')

Enable static cache and compile the forward pass

model.generation_config.cache_implementation = "static"
#model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model = torch.compile(model, backend="torch_tensorrt",dynamic=False)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch.float16,
device=f"cuda:{0}",
model_kwargs={"attn_implementation": "flash_attention_2"},
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

def iterate_data(dataset):
for i, item in enumerate(dataset):
yield item["audio"]

set the batch size in accordance to your device

BATCH_SIZE = 16

predictions = []

run streamed inference adapter 1

for out in pipe(iterate_data(dataset), batch_size=BATCH_SIZE):
predictions.append(out["text"])

print(predictions)

pipe.model.set_adapter('adapter 2')

run streamed inference adapter 2

for out in pipe(iterate_data(dataset), batch_size=BATCH_SIZE):
predictions.append(out["text"])

print(predictions)

Whisper Finetuning with lora

https://github.com/Vaibhavs10/fast-whisper-finetuning

StephennFernandes · 2024-08-26T19:23:33Z

@Jeevi10 thanks for the heads up.

I'll try to write an update for WhisperS2T for being able to use dynamic adapters

Jeevi10 · 2024-08-26T19:27:01Z

@StephennFernandes I am looking forward to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lora Dynamic switching for inference #71

Add Lora Dynamic switching for inference #71

Jeevi10 commented Aug 14, 2024 •

edited

Loading

StephennFernandes commented Aug 17, 2024

Jeevi10 commented Aug 26, 2024 •

edited

Loading

StephennFernandes commented Aug 26, 2024

Jeevi10 commented Aug 26, 2024

Add Lora Dynamic switching for inference #71

Add Lora Dynamic switching for inference #71

Comments

Jeevi10 commented Aug 14, 2024 • edited Loading

StephennFernandes commented Aug 17, 2024

Jeevi10 commented Aug 26, 2024 • edited Loading

Resources for dynamic lora:

I have provided some example repos where I got the idea from. Unfortunately I don't see any specific implementations for whisper directly.

Just to provide you an idea I created running example using huggingface transformers and peft,

Enable static cache and compile the forward pass

set the batch size in accordance to your device

run streamed inference adapter 1

run streamed inference adapter 2

Whisper Finetuning with lora

StephennFernandes commented Aug 26, 2024

Jeevi10 commented Aug 26, 2024

Jeevi10 commented Aug 14, 2024 •

edited

Loading

Jeevi10 commented Aug 26, 2024 •

edited

Loading