load whisper in float16 or int8, no external dependencies required #1990

phineas-pta · 2024-01-31T10:36:26Z

phineas-pta
Jan 31, 2024

advantages of quantization (float16 or int8):

lower memory footprint
faster processing

usually, it requires external libraries like: faster-whisper, transformers+bitsandbytes, whisper.cpp

BUT recent torch already has quantization built-in, so no need for external libraries

credit: https://github.com/MiscellaneousStuff/openai-whisper-cpu

import torch
import whisper

MODEL = whisper.load_model("large")
DTYPE = torch.qint8  # or torch.float16
MODEL = torch.quantization.quantize_dynamic(MODEL, {torch.nn.Linear}, dtype=DTYPE)

in case u dont have enough RAM/VRAM: quantize sequentially

MODEL.encoder = torch.quantization.quantize_dynamic(MODEL.encoder, {torch.nn.Linear}, dtype=DTYPE)
MODEL.decoder = torch.quantization.quantize_dynamic(MODEL.decoder, {torch.nn.Linear}, dtype=DTYPE)

glangford · 2024-02-08T21:47:58Z

glangford
Feb 8, 2024

Great tip, I have tried this with different content (large-v2, qint8) and the quality has been essentially the same.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load whisper in float16 or int8, no external dependencies required #1990

{{title}}

Replies: 1 comment

{{title}}

Select a reply

load whisper in float16 or int8, no external dependencies required #1990

phineas-pta Jan 31, 2024

Replies: 1 comment

glangford Feb 8, 2024

phineas-pta
Jan 31, 2024

glangford
Feb 8, 2024