A big improvement for dtype casting system with fp8 storage type and manual cast #14031

KohakuBlueleaf · 2023-11-19T15:14:34Z

Description

After pytorch 2.1.0, pytorch added 2 new dtype as storage type: float8_e5m2, float8_e4m3fn.[1][2]
Based on the papers which discuss the usage of fp8 as parameter/gradient for training/using NN models. I think it is worth doing some optimization with fp8 format.[3][4]
Also, some extension already support this feature too [5]

Mechanism

Although pytorch2.1.0 start supporting fp8 as storage type. We have only few hidden method for H100 to computing matmul with fp8 dtype.[6] Which means even though we can store model weights in FP8, we still need to use fp16 or bf16 to compute the result. (a.k.a upcasting)

Fortunately, pytorch's autocast can do it for us without any other changes. We just need to avoid some modules which not support fp8 storage, for example: nn.Embedding.
And for doing this for some devices which not support autocast, I also implement a manualcast hook which support GTX16xx(or even older) series to utilize fp8 features.

Manual Cast

The idea is pretty simple, when parameters and inputs have different dtype then target dtype (defined in devices.py), cast it to target dtype.

def manual_cast_forward(self, *args, **kwargs):
    org_dtype = next(self.parameters()).dtype
    self.to(dtype)
    args = [arg.to(dtype) if isinstance(arg, torch.Tensor) else arg for arg in args]
    kwargs = {k: v.to(dtype) if isinstance(v, torch.Tensor) else v for k, v in kwargs.items()}
    result = self.org_forward(*args, **kwargs)
    self.to(org_dtype)
    return result

I hooked the Linear, Conv, MultiHeadAttention, GroupNorm, LayerNorm layers. Which could almost covered all the layers we need. But since the hook is very simple, we can just add the layers that need to be patched in the future.

This feature also allow GTX16xx series or older to utilize fp16 (even fp8) weight to reduce vram usage significantly.

List of implemented things

FP8 cast for UNet and TE. (Linear/Conv layers only)
Manual Cast for GTX16xx/GTX10xx/MPS/CPU... (MPS not support FP8 storage yet)
XYZ-grid support for FP8 mode.
[For Dev] Forced reload option in reload_model_weights. (in FP8, I use this feature to apply fp8 settings correctly)
Built-in LoRA system change some dtype convert method so we can apply lora to a fp8 weight models.

Performance on VRAM usage/Speed/Quality

Vram usage

Theoretically, FP8 can reduce 50% of "vram usage introduced by weights" (FP8 vs FP16). So for sd1.x with --medvram, it can save about 800MB vram when inference, or 2.5G vram for sdxl.

Here is some measurement with HWinfo64 on Windows 11 23H2/RTX 4090
Launch args: --medvram --opt-channelslast --xformers

	1024x1024 SDXL	768x768 SD1.x	512x512 SD1.x
FP16	5923MB	2085MB	2128MB
FP8	3249MB	1198MB	935MB

Table 1. Vram usage comparison between FP16/FP8

First spike is FP16 run and Second spike is FP8

Upper bound: FP16 vram usage
Lower bound: idle vram usage
Cursor box: FP8 vram usage

We can find that FP8 save 2.5 ~ 2.8G vram in SDXL and 0.8 ~ 0.9G vram in SD1.x.
Which is almost match the theory.

Speed

Since use FP8 with FP16 computation need some extra operation to cast the dtype.
It will decrease the speed (especially for smaller batch size)

Batch size	768x768 SD1.x fp16	768x768 SD1.x fp8	1024x1024 SDXL fp16	1024x1024 SDXL fp8
1	8.27it/s	7.85it/s	3.84it/s	3.67it/s
4	3.19it/s	3.08it/s	1.51it/s	1.45it/s

Table 2. Inference speed comparison between FP16/FP8

We can find that although FP8 is slower than FP16 but the speed difference is smaller than 5%, which is acceptable.

Quality

Surprisingly, FP8 barely decrease the quality of generated image, sometime it may even improve the quality. But normally the differences are subtle.

Some comparisons here:

Image 1. Image comparison between FP16/FP8 on SDXL

Image 2. Image comparison between FP16/FP8 on SD1.x

But interestingly(or, actually expceted), use LoRA/LyCORIS with FP8 enabled will "weaken" the effect of the lora, which means you may need higher (1.5x ~ 3x) weight to reach same effect.
For example here is the comparison on LCM-lora + SD1.x models. with Euler A/cfg 2/steps8:

Image 3. LoRA effect on FP8

some reported improvement

Information from animatediff extension, use FP8 on UNet+CN+AN can achive 1024x1024 i2i on 4090. Which is impossible in the past.
Also reported from users of animatediff, use fp8+lcm lora can improve the quality of result.

Conclusion

FP8 is good and almost zero cost improvement on VRAM usage. Which is a good news for SDXL users.
Moreover, we can even use fp8 to storage the model weight directly which can reduce the ram requirement as well. (fp8 ckpt + fp8 storage enabled can achive sys ram 8G + vram 4G requirement with --medvram)

But the "weaken effect" on LoRA also reveal some possible problem after apply this method. We may need more feedback on this feature.But Good news for this problem: just disable it can avoid every affection.

And this feature also required pytorch2.1.0 which "may" be unstable. (since pytorch 2.1.1 already be stable, maybe we can consider to wait for xformers' update for pytorch 2.1.1 and then merge this PR)

Reference

Appandix

Training with FP8 also be implemented based on kohya-ss/sd-scripts codebase in my fork, I also provide some example models for it. With my implementation, users can train sdxl lora/lycoris on 6G vram card with TE/latent been cached, or train them on 8G vram card with nothing been cached with 1024x1024 arb.[7]

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

Since norm layer need fp32, I only convert the linear operation layer(conv2d/linear) And TE have some pytorch function not support bf16 amp in CPU. I add a condition to indicate if the autocast is for unet.

modules/launch_utils.py

werran2 · 2023-11-20T07:07:02Z

great work

BetaDoggo · 2023-11-20T23:01:25Z

Could the Lora issue be solved by merging the lora weights before converting the model to FP8 for inference? It would require the model to be reloaded every time a lora is changed but if the fp16 version is cached in ram I think it could still be fast enough to be worth it.

KohakuBlueleaf · 2023-11-21T00:44:25Z

Could the Lora issue be solved by merging the lora weights before converting the model to FP8 for inference? It would require the model to be reloaded every time a lora is changed but if the fp16 version is cached in ram I think it could still be fast enough to be worth it.

Yes it could be solved
I can add this kind of options

KohakuBlueleaf · 2023-11-21T02:30:07Z

Could the Lora issue be solved by merging the lora weights before converting the model to FP8 for inference? It would require the model to be reloaded every time a lora is changed but if the fp16 version is cached in ram I think it could still be fast enough to be worth it.

Sorry I'm wrong, it is quite hard to be solved since lora ext actually don't know when it should load the fp16 weights.

I have an idea is to cache the fp16 weight in fp8-layers directly (in CPU) but it will require more sys ram.
But it will be more easy to done it.

Don't know if you think it is ok. (and definitely, it will be an option)

KohakuBlueleaf · 2023-11-21T12:01:14Z

@BetaDoggo I have added a mechanism to cache/restore fp16 weight when needed. This will require 5G more system ram to achive it on SDXL. But it do give us closer result with fp8 when we using lora:

nosferatu500 · 2023-11-24T13:41:32Z

M2 Pro 16gb RAM

Same prompt + sampler + seed + plugins (too tired to describe everything).

Main branch: Using 16gb + 10gb swap
test-fp8 branch: using 13gb ram and 0mb swap

Bravo!

saunderez · 2023-12-13T17:49:46Z

Just wanted to give some feedback that I've been using your branch for a couple of weeks now and have had no problems at all (CUDA 4080). I'm finding minimal quality loss versus BF16 and generation speeds pretty much on par with LCM/Turbo models and LORAs at 16bit.

ClashSAN · 2023-12-15T17:47:02Z

Hi @KohakuBlueleaf,

Is fp8 going to be in v1.7.0?
If so, could you add a "fp8" or "experimental fp8" to metadata? it looks like the SD1.5 difference is much larger than --xformers or --upcast-sampling variations.

I also tried the fp8 settings option, and I get a non-reproducible image during the first switch from fp16 to fp8, when applying changes in settings:

stable-diffusion-v1-5	fp8	non-reproducible image

KohakuBlueleaf · 2023-12-15T17:50:44Z

Hi @KohakuBlueleaf,

Is fp8 going to be in v1.7.0? If so, could you add a "fp8" or "experimental fp8" to metadata? it looks like the SD1.5 difference is much larger than --xformers or --upcast-sampling variations.

I also tried the fp8 settings option, and I get a non-reproducible image during the first switch from fp16 to fp8, when applying changes in settings:

stable-diffusion-v1-5 fp8 non-reproducible image

I don't think fp8 will be merged into 1.7.0 but this is depends on Automatic
I'm not very sure what point is non-reproducible. Do you mean "when I first try fp8, it give me a totally different image but I cannot reproduce it" or "when I use fp8 branch, it give me a totally different image then before"
Thx for noticing me to add FP8 related settings into infotext

ClashSAN · 2023-12-15T18:31:11Z

vid.mp4

when I have fp8 enabled in settings, then exit the program. Then open the program, and the random seed distribution is entirely different. GPU seed is affected, not CPU seed.

KohakuBlueleaf · 2023-12-16T03:47:19Z

vid.mp4
when I have fp8 enabled in settings, then exit the program. Then open the program, and the random seed distribution is entirely different. GPU seed is affected, not CPU seed.

This is quite interesting, will check it

KohakuBlueleaf · 2023-12-16T07:14:24Z

vid.mp4
when I have fp8 enabled in settings, then exit the program. Then open the program, and the random seed distribution is entirely different. GPU seed is affected, not CPU seed.

Want to check if I understand this correctly:
"When startup the program with 'fp8' enabled, it will generate non-reproducible strange image"

But if the "fp8" is enabled "after" startup (startup with fp16), it will be normal.

KohakuBlueleaf · 2023-12-16T07:36:09Z

@ClashSAN I have investigated some similar effect but different.
I tried some debug log and all of them looks normal for me...
I will put this into dev server for more help.

AUTOMATIC1111 · 2023-12-16T07:44:21Z

I have a similar effect:

generate pic:
enable fp8 and generate pic:
disable fp8 and generate pic:

My suspicion is that it has to do with cond caching.

KohakuBlueleaf · 2023-12-16T07:45:05Z

cond caching.
Makes sense to me.

pkuliyi2015 · 2023-12-16T11:36:45Z

My suspicion is that it has to do with cond caching.

I confirmed your conjesture by breakpointing at the UNetModel's forward and do a switch of fp16->fp8->fp16. The result shows that context cache is permanently changed after enabling fp8. So the problem can be effectively fixed by invalidating the cond cache at the switching time.

Here is the crucial evidence:

Manchovies · 2024-01-23T19:55:17Z

I keep getting this error:
Traceback (most recent call last): File "C:\Users\sonic\AppData\Local\Programs\Python\Python310\lib\threading.py", line 973, in _bootstrap self._bootstrap_inner() File "C:\Users\sonic\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner self.run() File "C:\Users\sonic\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "D:\stable-diffusion-webui-1.5.1 (1)\newstablediffusionwebui\stable-diffusion-webui\modules\initialize.py", line 147, in load_model shared.sd_model # noqa: B018 File "D:\stable-diffusion-webui-1.5.1 (1)\newstablediffusionwebui\stable-diffusion-webui\modules\shared_items.py", line 110, in sd_model return modules.sd_models.model_data.get_sd_model() File "D:\stable-diffusion-webui-1.5.1 (1)\newstablediffusionwebui\stable-diffusion-webui\modules\sd_models.py", line 522, in get_sd_model load_model() File "D:\stable-diffusion-webui-1.5.1 (1)\newstablediffusionwebui\stable-diffusion-webui\modules\sd_models.py", line 649, in load_model load_model_weights(sd_model, checkpoint_info, state_dict, timer) File "D:\stable-diffusion-webui-1.5.1 (1)\newstablediffusionwebui\stable-diffusion-webui\modules\sd_models.py", line 395, in load_model_weights model.model.diffusion_model = model.model.diffusion_model.to(torch.float8_e4m3fn) AttributeError: module 'torch' has no attribute 'float8_e4m3fn'

How can I fix this? Sorry if this is the wrong place to ask, I'm somewhat new to github. Open to try any troubleshooting steps you think may help! Thanks for all your work on this! Looking forward to playing around with this.

BetaDoggo · 2024-01-24T14:40:02Z

@Manchovies You probably have an older version of torch. The fp8 options were added in torch 2.1. If you haven't reinstalled recently you likely have 2.0 or below. I believe if you delete or rename your venv folder the webui will install 2.1 automatically.

Manchovies · 2024-01-26T16:07:57Z

@Manchovies You probably have an older version of torch. The fp8 options were added in torch 2.1. If you haven't reinstalled recently you likely have 2.0 or below. I believe if you delete or rename your venv folder the webui will install 2.1 automatically.

That worked! Thank you so much. Now, how can I get xformers installed? I remember trying in the past, and it would always uninstall the version of torch I have installed and install an older version and xformers to go with it. Maybe put "xformers==0.0.22.post7" in the requirements.txt file, or pip install xformers==0.0.22.post7 in the venv folder?

Edit: that seems to have done it. went to /venv/scripts, ran "activate" in cmd, and typed "pip install xformers==0.0.22.post7" and it installed correctly without uninstalling or tampering with the torch install I had set up. Thanks again! Happy to be playing around with FP8 and enjoying the VRAM savings :)

Dampfinchen · 2024-03-03T12:17:01Z

Hm, I thought I could run SDXL completely in VRAM with this (6 GB) but it needs --medvram for it to not OOM. Comparing it to Comfy without FP8, Comfy takes around 12 seconds, WebUi with FP8 and medvram around 9 seconds, so its a decent improvement. But idk why it won't fit completely into VRAM.

KohakuBlueleaf · 2024-03-03T13:55:22Z

p

For 6GB or worse card, you can wait for the next big update for lowvram, which will be as fast as comfy/forge for super lowvram cards. At least the author have tested it on 3060 6G

KohakuBlueleaf and others added 18 commits October 19, 2023 13:56

Add fp8 for sd unet

7c128bb

Add sdxl only arg

5f9ddfa

Add CPU fp8 support

eaa9f51

Since norm layer need fp32, I only convert the linear operation layer(conv2d/linear) And TE have some pytorch function not support bf16 amp in CPU. I add a condition to indicate if the autocast is for unet.

Fix lint

9c1eba2

fp8 for TE

1df6c8b

Fix alphas_cumprod dtype

4830b25

Fix alphas cumprod

bf5067f

ignore mps for fp8

dda067f

change torch version

0beb131

ManualCast for 10/16 series gpu

d4d3134

Add MPS manual cast

ddc2a34

Merge branch 'dev' into test-fp8

c3facab

Merge branch 'dev' into test-fp8

cd12256

Merge branch 'dev' into test-fp8

b60e108

Use options instead of cmd_args

598da5c

Update the xformers/torch versions

890181e

update xformers/torch versions

f383af2

Better naming

043d2ed

KohakuBlueleaf requested a review from AUTOMATIC1111 as a code owner November 19, 2023 15:14

KohakuBlueleaf changed the title ~~A big imporvement for dtype casting system with fp8 storage type and manual cast~~ A big improvement for dtype casting system with fp8 storage type and manual cast Nov 19, 2023

bneigher reviewed Nov 20, 2023

View reviewed changes

modules/launch_utils.py Outdated Show resolved Hide resolved

Update webui-macos-env.sh

b2e039d

Option for using fp16 weight when apply lora

370a77f

Add forced reload for fp16 cache

f5d719d

Merge branch 'dev' into test-fp8

0fb34b5

Add FP8 settings into PNG info

ea27215

Merge branch 'dev' into test-fp8

8edb914

AUTOMATIC1111 approved these changes Dec 16, 2023

View reviewed changes

AUTOMATIC1111 merged commit c121f8c into dev Dec 16, 2023
6 checks passed

AUTOMATIC1111 deleted the test-fp8 branch December 16, 2023 07:22

KohakuBlueleaf mentioned this pull request Dec 16, 2023

Fix FP8 non-reproducible problem #14327

Merged

4 tasks

wangziyao318 mentioned this pull request Dec 29, 2023

Very temporary fix to use LoRA with fp8 weight enabled NVIDIA/Stable-Diffusion-WebUI-TensorRT#209

Open

KohakuBlueleaf mentioned this pull request Jan 17, 2024

[Experimental Feature] FP8 weight dtype for base model when running train_network (or sdxl_train_network) kohya-ss/sd-scripts#1057

Merged

w-e-w mentioned this pull request Feb 17, 2024

1.8.0-RC #14948

Closed

JarJarBeatyourattitude mentioned this pull request Mar 3, 2024

Using other SDXL turbo models to optimize the generation speed Fanghua-Yu/SUPIR#45

Open

mashb1t mentioned this pull request Mar 5, 2024

[Feature Request]: Add FP8 Support lllyasviel/Fooocus#2455

Open

1 task

pawel665j mentioned this pull request Apr 16, 2024

## 1.8.0-RC #15537

Closed

w-e-w mentioned this pull request Apr 18, 2024

[Bug]: Linux: SDXL-based models fail to load, PyTorch error #15566

Open

6 tasks

licyk mentioned this pull request Oct 19, 2024

[enhancement]: need to support fp8 invoke-ai/InvokeAI#7148

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A big improvement for dtype casting system with fp8 storage type and manual cast #14031

A big improvement for dtype casting system with fp8 storage type and manual cast #14031

KohakuBlueleaf commented Nov 19, 2023 •

edited

Loading

werran2 commented Nov 20, 2023

BetaDoggo commented Nov 20, 2023

KohakuBlueleaf commented Nov 21, 2023

KohakuBlueleaf commented Nov 21, 2023

KohakuBlueleaf commented Nov 21, 2023

nosferatu500 commented Nov 24, 2023

saunderez commented Dec 13, 2023

ClashSAN commented Dec 15, 2023

KohakuBlueleaf commented Dec 15, 2023 •

edited

Loading

ClashSAN commented Dec 15, 2023 •

edited

Loading

KohakuBlueleaf commented Dec 16, 2023

KohakuBlueleaf commented Dec 16, 2023

KohakuBlueleaf commented Dec 16, 2023

AUTOMATIC1111 commented Dec 16, 2023 •

edited

Loading

KohakuBlueleaf commented Dec 16, 2023

pkuliyi2015 commented Dec 16, 2023

Manchovies commented Jan 23, 2024

BetaDoggo commented Jan 24, 2024

Manchovies commented Jan 26, 2024 •

edited

Loading

Dampfinchen commented Mar 3, 2024 •

edited

Loading

KohakuBlueleaf commented Mar 3, 2024

A big improvement for dtype casting system with fp8 storage type and manual cast #14031

A big improvement for dtype casting system with fp8 storage type and manual cast #14031

Conversation

KohakuBlueleaf commented Nov 19, 2023 • edited Loading

Description

Mechanism

Manual Cast

List of implemented things

Performance on VRAM usage/Speed/Quality

Vram usage

Speed

Quality

some reported improvement

Conclusion

Reference

Appandix

Checklist:

werran2 commented Nov 20, 2023

BetaDoggo commented Nov 20, 2023

KohakuBlueleaf commented Nov 21, 2023

KohakuBlueleaf commented Nov 21, 2023

KohakuBlueleaf commented Nov 21, 2023

nosferatu500 commented Nov 24, 2023

saunderez commented Dec 13, 2023

ClashSAN commented Dec 15, 2023

KohakuBlueleaf commented Dec 15, 2023 • edited Loading

ClashSAN commented Dec 15, 2023 • edited Loading

KohakuBlueleaf commented Dec 16, 2023

KohakuBlueleaf commented Dec 16, 2023

KohakuBlueleaf commented Dec 16, 2023

AUTOMATIC1111 commented Dec 16, 2023 • edited Loading

KohakuBlueleaf commented Dec 16, 2023

pkuliyi2015 commented Dec 16, 2023

Manchovies commented Jan 23, 2024

BetaDoggo commented Jan 24, 2024

Manchovies commented Jan 26, 2024 • edited Loading

Dampfinchen commented Mar 3, 2024 • edited Loading

KohakuBlueleaf commented Mar 3, 2024

KohakuBlueleaf commented Nov 19, 2023 •

edited

Loading

KohakuBlueleaf commented Dec 15, 2023 •

edited

Loading

ClashSAN commented Dec 15, 2023 •

edited

Loading

AUTOMATIC1111 commented Dec 16, 2023 •

edited

Loading

Manchovies commented Jan 26, 2024 •

edited

Loading

Dampfinchen commented Mar 3, 2024 •

edited

Loading