Upgrade to exllama v2 #1016

flozi00 · 2023-09-12T15:04:25Z

Feature request

https://github.com/turboderp/exllamav2

Motivation

Overview of differences compared to V1
Faster, better kernels
Cleaner and more versatile codebase
Support for a new quant format

Model	Mode	Size	grpsz	act	V1: 3090Ti	V1: 4090	V2: 3090Ti	V2: 4090
Llama	GPTQ	7B	128	no	143 t/s	173 t/s	175 t/s	195 t/s
Llama	GPTQ	13B	128	no	84 t/s	102 t/s	105 t/s	110 t/s
Llama	GPTQ	33B	128	yes	37 t/s	45 t/s	45 t/s	48 t/s
OpenLlama	GPTQ	3B	128	yes	194 t/s	226 t/s	295 t/s	321 t/s

Your contribution

I could take a look to actual exllama implementation and what it takes to upgrade, if wanted

flozi00 · 2023-10-17T19:21:57Z

AutoGPTQ/AutoGPTQ#349

Maybe @SunMarc could give some advise?

SunMarc · 2023-10-18T19:07:51Z

Hi @flozi00 , happy to see that you are interested in adding support to TGI with exllamav2 kernel. I would be happy to review the PR. The integration with transformers and optimum is practically done too. On the kernel side, everything should work from my tests on autogptq. Make sure to add tests in TGI. You can take inspiration from @fxmarty integration of exllama in TGI.

josephrocca · 2023-11-28T02:21:44Z

I believe this issue can be closed now.

Exllama v2 #1211

josephrocca · 2024-01-05T14:11:23Z

@OlivierDehaene @Narsil I see you were both in the last commit to exllamav2.py. Per Benjamin's comment above, I'm also wondering, whether TGI currently has usable exl2 support?

If it's currently not supported, then I think this should be reopened?

OlivierDehaene · 2024-01-09T15:56:28Z

exllama v2 is super fast but also super finicky. Is it activated by default if you are not sharding your model as we could not make it work with TP yet.

josephrocca · 2024-01-09T18:38:30Z

@OlivierDehaene Does this mean that we can load exl2 models with TGI? Or is this only for running GPTQ models with exllama runtime/kernels? (I'm not sure how that works - but IIRC there is no --quantize exl2 option available.)

The former would be great because then we'd get Mixtral on a single 3090: https://twitter.com/turboderp_/status/1741232488301596674 (I've tested this using the "raw" exllama runtime and it works great, but I'm not sure how to do it with TGI)

Narsil · 2024-01-10T15:21:58Z

Yes it's only the GPTQ versions of exl2. exl2 layout is a bit more finicky to add, although probably not impossible.

Technically it's using exactly the same kernels:

Names are different: https://github.com/turboderp/exllamav2/blob/master/exllamav2/module.py#L82 and here: https://github.com/turboderp/exllamav2/blob/master/exllamav2/ext.py#L184

Although the fact that exl2 crashes pretty badly with TP>1 is quite concerning, and I don't really want to debug exl2 kernels for now (I've spent quite some time trying to find why the kernels segfaults but couldn't find why it was hitting so far out of where it should, I'm guessing some too manual pointer logic to hit the "scratch" buffers.)

If you want to look into it, that'd be great tbh.
I must say that the TP>1 bug has been the main roadblock for exl2 since we really need to be able to load models with TP for production loads (it yields such latency improvements most of the times it's kind of hard to pass on). Also we try not to use quantization too much since it does harm the model quite extensively (especially out of domain which cannot be captured by benchmarks).

josephrocca · 2024-02-20T07:29:25Z

@Narsil This pull request of yours looks exciting:

Reinstate exl2 with tp #1490

Does it pave the way for loading the exl2 model format in TGI? Or is it not something that the team is too interested in right now?

Side note: There are actually more exl2 models (3.2k) on the hub now than GPTQ (2.6k), though this is somewhat due to some prolific users doing lots of quants, since the number of unique users who have published an exl2 quant is 132, whereas there are 511 for GPTQ¹. Still, GPTQ had a 9 month head start, and it does seem like exl2 is becoming more popular recently for its ability to fit very large models (>100B) into GPUs. Heavy quantization does seem like it's going to be the "future" to some extent - https://twitter.com/tri_dao/status/1757331306260922515

tri_dao: I've been curious about quantization scaling laws since Tim_Dettmers showed that 4-bit gets best quality holding total model bits constant. With clever math in QuIP# + some finetuning, looks like the sweet spot is shifting to 3-bit or maybe even 2-bit in the future (quote tweeting: https://twitter.com/tsengalb99/status/1757145731448656060)

¹ Data collection code (click to expand)

let rows = [];
for(let p = 0; p <= 120; p++) { // 120 pages as of writing
  let data = await fetch(`https://huggingface.co/models-json?p=${p}&sort=modified&search=exl2`).then(r => r.json());
  rows.push(...data.models);
  console.log(`Page: ${p}`);
}
for(let p = 0; p <= 80; p++) { // 80 pages as of writing
  let data = await fetch(`https://huggingface.co/models-json?p=${p}&sort=modified&search=gptq`).then(r => r.json());
  rows.push(...data.models);
  console.log(`Page: ${p}`);
}
rows.forEach(r => r.lastModifiedEpochMs=new Date(r.lastModified).getTime());
rows.sort((a,b) => a.lastModifiedEpochMs-b.lastModifiedEpochMs);

let gotAuthor = new Set();
for(let str of ["exl2", "gptq"]) {
  let times = rows.filter(m => m.id.includes(str)).filter(m => gotAuthor.has(m.author) ? false : (gotAuthor.add(m.author), true)).map(m => m.lastModifiedEpochMs);
  console.log(str, times.length);
}

houmie · 2024-04-29T09:09:35Z

If you want to look into it, that'd be great tbh. I must say that the TP>1 bug has been the main roadblock for exl2 since we really need to be able to load models with TP for production loads (it yields such latency improvements most of the times it's kind of hard to pass on). Also we try not to use quantization too much since it does harm the model quite extensively (especially out of domain which cannot be captured by benchmarks).

Hi @Narsil
Sorry I found this by googling and your comment was insightful. I'm a big fan of exl2 format due obvious benefits, but have difficulties finding a production ready host platform that supports it. To name a few TGI, vLLM, olama don't support it yet. That makes me think maybe I shouldn't use exl2 after all and fallback to something else that is widely supported and is production ready. Which format do you recommend, please?

Thanks

josephrocca · 2024-04-29T15:19:59Z

Tangential note: At least with Llama 3 70B, with constraint of fitting within 48GB VRAM, the community seems to be leaning toward EXL2. E.g. the EXL2 quants came out on top in WolframRavenwolf's most recent tests:

https://www.reddit.com/r/LocalLLaMA/comments/1cal17l/llm_comparisontest_llama_3_instruct_70b_8b/

The best inference engine for keeping up with quantization (including EXL2) right now seems to be https://github.com/PygmalionAI/aphrodite-engine and it worked well in my tests a couple of months ago, but I haven't put it into production yet.

Second place in Wolfram's tests was AWQ, which TGI does currently have support for.

fxmarty · 2024-04-30T14:34:31Z

AFAIK EXL2 is on the roadmap for TGI.

Narsil · 2024-05-01T08:17:07Z

@houmie exl2 is very nice, this would be my go-to for <4bit models (and the reason why we want to add support).

For 4bit quants I'd say AWQ/GPTQ are both great (problem GPTQ comes with different flavors which have different performance profiles with the right options GPTQ has better latency, slightly worse throughput than AWQ, but pretty much the same overall).

flozi00 mentioned this issue Oct 17, 2023

WIP add exllamav2 #1165

Closed

5 tasks

fxmarty closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to exllama v2 #1016

Upgrade to exllama v2 #1016

flozi00 commented Sep 12, 2023

flozi00 commented Oct 17, 2023

SunMarc commented Oct 18, 2023

josephrocca commented Nov 28, 2023

josephrocca commented Jan 5, 2024 •

edited

Loading

OlivierDehaene commented Jan 9, 2024

josephrocca commented Jan 9, 2024 •

edited

Loading

Narsil commented Jan 10, 2024

josephrocca commented Feb 20, 2024

houmie commented Apr 29, 2024

josephrocca commented Apr 29, 2024 •

edited

Loading

fxmarty commented Apr 30, 2024

Narsil commented May 1, 2024 •

edited

Loading

Upgrade to exllama v2 #1016

Upgrade to exllama v2 #1016

Comments

flozi00 commented Sep 12, 2023

Feature request

Motivation

Your contribution

flozi00 commented Oct 17, 2023

SunMarc commented Oct 18, 2023

josephrocca commented Nov 28, 2023

josephrocca commented Jan 5, 2024 • edited Loading

OlivierDehaene commented Jan 9, 2024

josephrocca commented Jan 9, 2024 • edited Loading

Narsil commented Jan 10, 2024

josephrocca commented Feb 20, 2024

houmie commented Apr 29, 2024

josephrocca commented Apr 29, 2024 • edited Loading

fxmarty commented Apr 30, 2024

Narsil commented May 1, 2024 • edited Loading

josephrocca commented Jan 5, 2024 •

edited

Loading

josephrocca commented Jan 9, 2024 •

edited

Loading

josephrocca commented Apr 29, 2024 •

edited

Loading

Narsil commented May 1, 2024 •

edited

Loading