Skip to content

Exllamav2 production prospect #355

Answered by turboderp
luisfrentzen-cc asked this question in Q&A
Discussion options

You must be logged in to vote

It depends what you mean by production. If you mean running on a large inference server with many concurrent users, then no, it's not all too well suited for that. I would consider paged attention an essential feature, for instance (for efficient continuous batching). That may be coming soon, but this is all still largely a solo project and I only have so much time to dedicate to each feature.

What's more, as you go up in batch size, the benefits of quantization start to matter less and less. The amount of VRAM required for context scales with the number of concurrent users you want to support, while the weights stay the same size. So when you eventually have to reserve 200 GB of VRAM for…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@luisfrentzen-cc
Comment options

Answer selected by luisfrentzen-cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants