-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to exllama v2 #1016
Comments
Maybe @SunMarc could give some advise? |
Hi @flozi00 , happy to see that you are interested in adding support to TGI with exllamav2 kernel. I would be happy to review the PR. The integration with transformers and optimum is practically done too. On the kernel side, everything should work from my tests on autogptq. Make sure to add tests in TGI. You can take inspiration from @fxmarty integration of exllama in TGI. |
I believe this issue can be closed now. |
@OlivierDehaene @Narsil I see you were both in the last commit to If it's currently not supported, then I think this should be reopened? |
exllama v2 is super fast but also super finicky. Is it activated by default if you are not sharding your model as we could not make it work with TP yet. |
@OlivierDehaene Does this mean that we can load exl2 models with TGI? Or is this only for running GPTQ models with exllama runtime/kernels? (I'm not sure how that works - but IIRC there is no The former would be great because then we'd get Mixtral on a single 3090: https://twitter.com/turboderp_/status/1741232488301596674 (I've tested this using the "raw" exllama runtime and it works great, but I'm not sure how to do it with TGI) |
Yes it's only the GPTQ versions of exl2. exl2 layout is a bit more finicky to add, although probably not impossible. Technically it's using exactly the same kernels: Names are different: https://github.com/turboderp/exllamav2/blob/master/exllamav2/module.py#L82 and here: https://github.com/turboderp/exllamav2/blob/master/exllamav2/ext.py#L184 Although the fact that exl2 crashes pretty badly with TP>1 is quite concerning, and I don't really want to debug exl2 kernels for now (I've spent quite some time trying to find why the kernels segfaults but couldn't find why it was hitting so far out of where it should, I'm guessing some too manual pointer logic to hit the "scratch" buffers.) If you want to look into it, that'd be great tbh. |
@Narsil This pull request of yours looks exciting: Does it pave the way for loading the exl2 model format in TGI? Or is it not something that the team is too interested in right now? Side note: There are actually more exl2 models (3.2k) on the hub now than GPTQ (2.6k), though this is somewhat due to some prolific users doing lots of quants, since the number of unique users who have published an exl2 quant is 132, whereas there are 511 for GPTQ1. Still, GPTQ had a 9 month head start, and it does seem like exl2 is becoming more popular recently for its ability to fit very large models (>100B) into GPUs. Heavy quantization does seem like it's going to be the "future" to some extent - https://twitter.com/tri_dao/status/1757331306260922515
1 Data collection code (click to expand)let rows = [];
for(let p = 0; p <= 120; p++) { // 120 pages as of writing
let data = await fetch(`https://huggingface.co/models-json?p=${p}&sort=modified&search=exl2`).then(r => r.json());
rows.push(...data.models);
console.log(`Page: ${p}`);
}
for(let p = 0; p <= 80; p++) { // 80 pages as of writing
let data = await fetch(`https://huggingface.co/models-json?p=${p}&sort=modified&search=gptq`).then(r => r.json());
rows.push(...data.models);
console.log(`Page: ${p}`);
}
rows.forEach(r => r.lastModifiedEpochMs=new Date(r.lastModified).getTime());
rows.sort((a,b) => a.lastModifiedEpochMs-b.lastModifiedEpochMs);
let gotAuthor = new Set();
for(let str of ["exl2", "gptq"]) {
let times = rows.filter(m => m.id.includes(str)).filter(m => gotAuthor.has(m.author) ? false : (gotAuthor.add(m.author), true)).map(m => m.lastModifiedEpochMs);
console.log(str, times.length);
} |
Hi @Narsil Thanks |
Tangential note: At least with Llama 3 70B, with constraint of fitting within 48GB VRAM, the community seems to be leaning toward EXL2. E.g. the EXL2 quants came out on top in WolframRavenwolf's most recent tests: https://www.reddit.com/r/LocalLLaMA/comments/1cal17l/llm_comparisontest_llama_3_instruct_70b_8b/ The best inference engine for keeping up with quantization (including EXL2) right now seems to be https://github.com/PygmalionAI/aphrodite-engine and it worked well in my tests a couple of months ago, but I haven't put it into production yet. Second place in Wolfram's tests was AWQ, which TGI does currently have support for. |
AFAIK EXL2 is on the roadmap for TGI. |
@houmie exl2 is very nice, this would be my go-to for <4bit models (and the reason why we want to add support). For 4bit quants I'd say AWQ/GPTQ are both great (problem GPTQ comes with different flavors which have different performance profiles with the right options GPTQ has better latency, slightly worse throughput than AWQ, but pretty much the same overall). |
Feature request
https://github.com/turboderp/exllamav2
Motivation
Overview of differences compared to V1
Faster, better kernels
Cleaner and more versatile codebase
Support for a new quant format
Your contribution
I could take a look to actual exllama implementation and what it takes to upgrade, if wanted
The text was updated successfully, but these errors were encountered: