Reporting Performance Differences with Updates #1181

lllyasviel · 2024-08-16T10:07:22Z

lllyasviel
Aug 16, 2024
Maintainer

Some people reported performance degradation with updates.

However, most are just caused by people forgetting that they are using different models (these days models change a lot and many people have downloaded lots of versions ... ); note that different model architecture (NF/GGUF/FP) are expected to have different performance; or because some old GPU are not very stable <- for example, this guy's Forge became slow after one line of text is added to readme🤯

If you are sure that some updates caused slow down, and knows how to git checkout, you can put full console logs before after some commit with different speed in this post.

We also noticed (with several strong evidences) that some people are spreading misinformation about Forge being slow with an intention to promote their custom workflow. <- if you are doing it right now, please stop it.

Finally, remember that giving full console logs before after some commit with different speed (if true) really helps us. Also, it would be better if you have screenshots like this:

(these are statics for generation)

PS: some people seem to have better performance with --cuda-malloc in their CMD args. Although that never happened to my 5 different test devices, you may try it and see what will happen.

Do Not Set GPU WEIGHT to Max Value!

Some people think that setting GPU weight to max will fit everything into GPU and it is faster. No, it is not. If you set GPU weight to max value, you model is in GPU, but you do not have GPU free memory to do computation, and the speed may be 10x slower.

Typical Example of User Mistake

Below is a typical example of users randomly setting options to worse values and then complain about performance:

His screenshot is

There are 3 problems in this screenshot

This means this user will use 100% GPU memory to load weights with 0% memory to compute. So, it is expected to be about 10x slower.

This means this user will use two workers to move layers and compute together. This can make things faster if it works, but consider that 0% GPU memory can be used in computation, this will not work because there is no free VRAM to compute.

This mean the user will use shared GPU memory, and the Forge in the screenshot is doing that. This can make things faster if it works, but consider that 0% GPU memory can be used in computation, this will not work in this case.

In this screenshot, everything works perfectly. It is slow because the user sets it to run in a slow mode.

After read the original instructions again, the user get normal speed back.

So, make sure to read the instructions!

misterlillo60 · 2024-08-16T14:35:35Z

misterlillo60
Aug 16, 2024

hi I have a problem with controlnet regarding the pony models, I don't know where to write so I apologize in advance.
it practically doesn't work well as you will see in this photo. He doesn't do exactly the same pose, why?

0 replies

Greywolf665 · 2024-08-16T22:14:06Z

Greywolf665
Aug 16, 2024

I'd like to report a performance difference with Flux - it's gotten insanely good and I don't understand why ^^
I have a 10GB RTX 3080 and I'm running the (git checkout 6e6e5c2) version.

1 image 832x1216 Euler simple 30 steps with:
NF4 v2 model - 54 seconds (like before)
FP8 model - 54 seconds (no more out of memory error)
FP8 model with FP16 clip - 54 seconds (no more out of memory error)
Lora loading times differ, but at most 120seconds before a batch so far - after that, 54 seconds per image.
I have to restart Forge every now and then, when memory management messes up.

But.. how the hell am I suddenly running a full fp16 clip model on my 10GB gpu? o.o
And how is it just as fast as the NF4 model? o.o
I don't know how and why - but I'm grateful. Amazing job.
With the fp8 model Lora loading isn't an issue - with NF 4 only a few small Loras work.. but hell, I'm running FP8 with FP16 clip now ^^

(just because I still don't understand this magic:
[Memory Management] Current Free GPU Memory: 8930.62 MB
[Memory Management] Required Model Memory: 11350.07 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -3443.45 MB
[Memory Management] Loaded to Shared Swap: 5275.26 MB (blocked method)
[Memory Management] Loaded to GPU: 6074.79 MB)
it works!

Maybe relevant from the webui user bat:
set COMMANDLINE_ARGS= --theme dark --cuda-malloc --cuda-stream --pin-shared-memory --disable-xformers

0 replies

Woukim · 2024-08-17T16:57:38Z

Woukim
Aug 17, 2024

set COMMANDLINE_ARGS= --theme dark --cuda-malloc --cuda-stream --pin-shared-memory --disable-xformers
for comfyui is that relevant?😅

0 replies

ucukertz · 2024-08-19T02:03:56Z

ucukertz
Aug 19, 2024

If GPU weight is not recommended to set to max then what is the recommended safe value which guarantees the 10x slowdown never happens? Maybe something like max value minus X?

2 replies

safzanpirani Aug 19, 2024

about 1-1.5gb of free vram

ucukertz Aug 20, 2024

1.5GB it is. Thank you!

evanheckert · 2024-08-22T03:05:46Z

evanheckert
Aug 22, 2024

Would love to just hear general reports of what speeds people are getting with their hardware.

I think I may be underperforming, for example:
12600k
96GB DDR4 3200
RTX 3090 w/GR Driver v560.70
Windows 11 latest
1024 x 1024

TL;DR: 1.7s/it

Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-395-g0d8eb4c5
Commit hash: 0d8eb4c5ba211ab468e270989a81c57c3783f465
Launching Web UI with arguments: --listen --port 7862 --cuda-stream --pin-shared-memory --cuda-malloc
Using cudaMallocAsync backend.
Total VRAM 24576 MB, total RAM 98045 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Always pin shared GPU memory
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: True
Using pytorch cross attention
Using pytorch attention for VAE
[-] ADetailer initialized. version: 24.1.2, num models: 9
2024-08-21 21:50:51,704 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {
'checkpoint_info': { 'filename': '{snip}\flux1-dev-Q8_0.gguf', 'hash': 'b44b9b8a'}, 
'additional_modules': ['{snip}\clip_l.safetensors', '{snip}\ae.safetensors', '{snip}\t5xxl_fp16.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL:  http://0.0.0.0:7862
[GPU Setting] You will use 81.30% GPU memory (19980.00 MB) to load weights, and use 18.70% GPU memory (4595.00 MB) to do matrix computation.

[Unload] Trying to free 953674316406250018963456.00 MB for cuda:0 with 0 models keep loaded ...
StateDict Keys: {'transformer': 780, 'vae': 244, 'text_encoder': 196, 'text_encoder_2': 220, 'ignore': 0}
Using Default T5 Data Type: torch.float16
Using Detected UNet Type: gguf
Using pre-quant state dict!
Using GGUF state dict: {'F16': 476, 'Q8_0': 304}
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'gguf', 'computation_dtype': torch.bfloat16}
Model loaded in 6.3s (unload existing model: 0.2s, forge model load: 6.1s).
[LORA] Loaded {snip}\mylora.safetensors for KModel-UNet with 304 keys at weight 1.0 (skipped 0 keys)
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
To load target model JointTextEncoder
Begin to load 1 model
[Unload] Trying to free 17035.34 MB for cuda:0 with 0 models keep loaded ...
[Memory Management] Current Free GPU Memory: 23293.05 MB
[Memory Management] Required Model Memory: 9569.49 MB
[Memory Management] Required Inference Memory: 4595.00 MB
[Memory Management] Estimated Remaining GPU Memory: 9128.55 MB
Moving model(s) has taken 4.03 seconds
Distilled CFG Scale: 2.5
To load target model KModel
Begin to load 1 model
[Unload] Trying to free 20350.41 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 13581.71 MB ...
[Unload] Unload model JointTextEncoder
[Memory Management] Current Free GPU Memory: 23223.69 MB
[Memory Management] Required Model Memory: 12119.55 MB
[Memory Management] Required Inference Memory: 4595.00 MB
[Memory Management] Estimated Remaining GPU Memory: 6509.14 MB
Patching LoRAs for KModel: 100%|█████████████████████████████████████████████████████| 304/304 [00:05<00:00, 57.11it/s]
LoRA patching has taken 5.32 seconds
Moving model(s) has taken 7.21 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:33<00:00,  1.68s/it]
To load target model IntegratedAutoencoderKL                                          | 20/460 [00:51<12:29,  1.70s/it]

1 reply

HMRMike Aug 22, 2024

Getting about the same speed with my 3090 and similar settings. It's about 1.65s/it on just a text prompt, and down to 1.8-1.9 depending on LORA loaded. Async swap can sometimes misbehave and loads up only 7GB to VRAM, then speed drops to 2s/it, other times it fills up all 24, and complains. GGUF isn't really aiming for speed after all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporting Performance Differences with Updates #1181

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Reporting Performance Differences with Updates #1181

lllyasviel Aug 16, 2024 Maintainer

Do Not Set GPU WEIGHT to Max Value!

Typical Example of User Mistake

Replies: 5 comments · 3 replies

lllyasviel
Aug 16, 2024
Maintainer

Replies: 5 comments 3 replies