AMD generating takes 25 minutes #1958

mpirescarvalho · 2024-01-17T12:31:20Z

Read Troubleshoot

[x] I admit that I have read the Troubleshoot before making this issue.

Describe the problem
Its working but its taking SUPER long to generate the images.

CPU: AMD Ryzen 7 5700X
RAM: 16 GB
SWAP: 44GB on M.2 SSD
GPU: AMD Radeon RX 6700 XT 12 GB VRAM

Full Console Log
C:\www\stable-diffusion\Fooocus>.\python_embeded\python.exe -s Fooocus\entry_with_update.py --directml
Already up-to-date
Update succeeded.
[System ARGV] ['Fooocus\entry_with_update.py', '--directml']
Python 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]
Fooocus version: 2.1.862
Running on local URL: http://127.0.0.1:7865

To create a public link, set share=True in launch().
Using directml with device:
Total VRAM 1024 MB, total RAM 16310 MB
Set vram state to: NORMAL_VRAM
Always offload VRAM
Device: privateuseone
VAE dtype: torch.float32
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --attention-split
Refiner unloaded.
model_type EPS
UNet ADM Dimension 2816
Using split attention in VAE
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
Using split attention in VAE
extra {'cond_stage_model.clip_l.logit_scale', 'cond_stage_model.clip_g.transformer.text_model.embeddings.position_ids', 'cond_stage_model.clip_l.text_projection'}
Base model loaded: C:\www\stable-diffusion\Fooocus\Fooocus\models\checkpoints\juggernautXL_version6Rundiffusion.safetensors
Request to load LoRAs [['sd_xl_offset_example-lora_1.0.safetensors', 0.1], ['None', 1.0], ['None', 1.0], ['None', 1.0], ['None', 1.0]] for model [C:\www\stable-diffusion\Fooocus\Fooocus\models\checkpoints\juggernautXL_version6Rundiffusion.safetensors].
Loaded LoRA [C:\www\stable-diffusion\Fooocus\Fooocus\models\loras\sd_xl_offset_example-lora_1.0.safetensors] for UNet [C:\www\stable-diffusion\Fooocus\Fooocus\models\checkpoints\juggernautXL_version6Rundiffusion.safetensors] with 788 keys at weight 0.1.
Fooocus V2 Expansion: Vocab with 642 words.
Fooocus Expansion engine loaded for cpu, use_fp16 = False.
Requested to load SDXLClipModel
Requested to load GPT2LMHeadModel
Loading 2 new models
App started successful. Use the app with http://127.0.0.1:7865/ or 127.0.0.1:7865
[Parameters] Adaptive CFG = 7
[Parameters] Sharpness = 2
[Parameters] ADM Scale = 1.5 : 0.8 : 0.3
[Parameters] CFG = 4.0
[Parameters] Seed = 3435339128246104584
[Parameters] Sampler = dpmpp_2m_sde_gpu - karras
[Parameters] Steps = 30 - 15
[Fooocus] Initializing ...
[Fooocus] Loading models ...
Refiner unloaded.
[Fooocus] Processing prompts ...
[Fooocus] Preparing Fooocus text #1 ...
[Prompt Expansion] cat in spacesuit, light shining, intricate, elegant, sharp focus, professional color, highly detailed, sublime, innocent, dramatic, cinematic, new classic, beautiful, dynamic, attractive, cute, epic, stunning, brilliant, creative, positive, artistic, awesome, confident, colorful, shiny, iconic, cool, best, pure, quiet, lovely, great, relaxed
[Fooocus] Preparing Fooocus text #2 ...
[Prompt Expansion] cat in spacesuit, light flowing colors, extremely detailed, beautiful, intricate, elegant, sharp focus, highly detail, dramatic cinematic perfect, open color, inspired, rich deep vivid vibrant scenic full atmosphere, professional composition, stunning, magical, amazing, creative, wonderful, epic, hopeful, awesome, brilliant, surreal, symmetry, ambient, best, pure, fine, very
[Fooocus] Encoding positive #1 ...
[Fooocus] Encoding positive #2 ...
[Fooocus] Encoding negative #1 ...
[Fooocus] Encoding negative #2 ...
[Parameters] Denoising Strength = 1.0
[Parameters] Initial Latent shape: Image Space (896, 1152)
Preparation time: 12.77 seconds
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.0291671771556139, sigma_max = 14.614643096923828
Requested to load SDXL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 70.08 seconds
0%| | 0/30 [00:00<?, ?it/s]C:\www\stable-diffusion\Fooocus\Fooocus\modules\anisotropic.py:132: UserWarning: The operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17.)
s, m = torch.std_mean(g, dim=(1, 2, 3), keepdim=True)
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [11:59<00:00, 23.98s/it]
Requested to load AutoencoderKL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 1.60 seconds
Image generated with private log at: C:\www\stable-diffusion\Fooocus\Fooocus\outputs\2024-01-17\log.html
Generating and saving time: 795.84 seconds
[Sampler] refiner_swap_method = joint
[Sampler] sigma_min = 0.0291671771556139, sigma_max = 14.614643096923828
Requested to load SDXL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 58.36 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [24:17<00:00, 48.57s/it]
Requested to load AutoencoderKL
Loading 1 new model
loading in lowvram mode 64.0
[Fooocus Model Management] Moving model(s) has taken 1.25 seconds
Image generated with private log at: C:\www\stable-diffusion\Fooocus\Fooocus\outputs\2024-01-17\log.html
Generating and saving time: 1519.52 seconds

The text was updated successfully, but these errors were encountered:

f0n51 · 2024-01-17T13:40:55Z

Your Fooocus is generating images with CPU only. Thats the reason it takes so long.
The operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU

Have you read the readme concerning AMD GPUs?
https://github.com/lllyasviel/Fooocus?tab=readme-ov-file#windowsamd-gpus

mpirescarvalho · 2024-01-17T13:46:25Z

Yes, I followed the instructions. This is my run.bat:

.\python_embeded\python.exe -m pip uninstall torch torchvision torchaudio torchtext functorch xformers -y
.\python_embeded\python.exe -m pip install torch-directml
.\python_embeded\python.exe -s Fooocus\entry_with_update.py --directml
pause

note: gpu memory is being used on the generating process

I've seen others running on amd gpus on windows, not sure what is happening

f0n51 · 2024-01-17T13:52:53Z

I'm having the same problems on Windows with my AMD GPU too and still couldn't find out whats wrong. There's an open issue in the DirectML project here:
https://github.com/microsoft/DirectML/issues/536

I switched over to Google Colab as I couldn't get good results with my AMD GPU, neither on Windows nor on Linux.

mpirescarvalho · 2024-01-17T14:01:47Z

I'll keep an eye on that issue, thanks

f0n51 · 2024-01-17T14:04:40Z

Maybe someone from the dev team joins and has a solution for that. I personally gave up on harassing my AMD GPU :-D

pscheit · 2024-01-17T14:07:02Z

@mpirescarvalho looks like it is using a GPU but one with only 1024 RAM? (It says it uses low vram mode)
So maybe this is your onboard graphic?

mpirescarvalho · 2024-01-17T14:08:59Z

Negative, my processor doesnt have onboard graphics

patientx · 2024-01-17T15:18:32Z

vram being "1024" is normal. same in comfyui. that's how the dml reports it.

*** First of all I have to say thanks to the devs for finally building an app that I can use on windows to generate with SDXL models without crashing instantly or at best at second try. I tried sdwebui, sdnext, comfyui and only with sdnext I was able to gen but there app just gives out of memory instantly or at best scenerio 2nd try. With fooocus if I change a lot of models eventually the same out of memory errors pop up but if I use one or two models constantly I just get slow generation , no crashes at all ...

I am using an rx 6600 8 gb and did various things to speed up the generation.

First enabled lowvram from the cmdline, but since you have 12gb you should be better in this regard, still that could work.
Second , my windows swap file was on my ssd which is my C drive, after seeing that because of our vram problems app just moves the models to system ram and eventually it fills up and switches to using swap file I thought of moving the swap file to my nvme drive which I normally use for other sdwebui stuff and games. Just because of this the moving of models FELL FROM 80-100 SEC to 30-40 secs.
Third, use turbosdxl models with 8-10 steps or use any sdxl with LCM lora. With sd1.5, the results I am able to get after upscaling 2-3 times with various techniques while trying not to trigger out of memory errors, is faaaar beneath what I get here instantly. OK , the times I spent there around the same with sd1.5 with all the upscaling but here it is far less worrisome to do.

ALSO , That error "the operator 'aten::std_mean.correction' is not currently supported on the DML backend and will fall back to run on the CPU" pops up from time to time on comfyui too, and as far as I know in comfyui that error didn't actually affected the speed at all maybe a bit ?... I am not sure how much it effects the sdxl generation though ... As far as I know it is already slow

*** Finally I also have to note here , I am using both my gpu and cpu at underpowered underclocked states. 6600 is 48w power limited and my 3600 is 40w power limited. With this limits in mind :::

Before I moved my swap to nvme , an 8 step LCM sampled "extreme speed" using the default model juggernaught...
3 image generation , preparetion time 18-20 secs, model offloading to system memory 80 to 100 secs (smaller vae's and stuff happens in 2-3sec), 8 step generation around 10sec/it, total time 485 seconds ,

After moving the swap and making it 2 times my system memory (16 gb mem - 32 gb swap file on nvme)
3 image generation , preparetion time 17-18 secs, model offloading to system memory around 30 secs each time more or less, (smaller vae's and stuff happens in 2-3sec), 8 step generation around 7sec/it, total time 280 seconds ,

So, swap in nvme and two times system ram is very effective, lcm is effective or using turbo models. I am atm using normal models with lcm lora and lcm samplers in around 10-12 steps. Cfg 2 for turbo and 1.5 for lcm so that I can use negatives.

mashb1t · 2024-01-17T21:30:41Z

@f0n51 thank you for providing the reference to DirectML and @patientx for your insights.

microsoft/DirectML#536 (comment) already references to #1321, which i closed 2 weeks ago in #1321 (comment), as this is not an issue of Fooocus.

We can still keep this issue open but i'd suggest to close it as there's nothing we can actively do.
@mpirescarvalho this is your call

patientx · 2024-01-17T21:47:04Z

One thing to do is , once the 1st gen starts and that error pops up just skip or stop it, the next ones won't have it. Just tested a bit more and the first one always has that error and the step time is around 40 here too, but if I cancel it and start again this time step time starts around 30 and drops to around 20ish for me. /remember 48w power limited 6600/ so probably with 12gb vram and full power 6700 xt everything would be much faster.

mpirescarvalho · 2024-01-18T10:36:53Z

@mashb1t agreed

mpirescarvalho · 2024-01-25T22:15:57Z

Update:

After adding more 16GB of ram to my setup, generating time went down to 2 minutes per image

mashb1t added duplicate This issue or pull request already exists bug (AMD) Something isn't working (AMD specific) labels Jan 17, 2024

mpirescarvalho closed this as completed Jan 18, 2024

mpirescarvalho closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD generating takes 25 minutes #1958

AMD generating takes 25 minutes #1958

mpirescarvalho commented Jan 17, 2024 •

edited

Loading

f0n51 commented Jan 17, 2024

mpirescarvalho commented Jan 17, 2024 •

edited

Loading

f0n51 commented Jan 17, 2024

mpirescarvalho commented Jan 17, 2024

f0n51 commented Jan 17, 2024 •

edited

Loading

pscheit commented Jan 17, 2024

mpirescarvalho commented Jan 17, 2024

patientx commented Jan 17, 2024

mashb1t commented Jan 17, 2024

patientx commented Jan 17, 2024 •

edited

Loading

mpirescarvalho commented Jan 18, 2024

mpirescarvalho commented Jan 25, 2024 •

edited

Loading

AMD generating takes 25 minutes #1958

AMD generating takes 25 minutes #1958

Comments

mpirescarvalho commented Jan 17, 2024 • edited Loading

f0n51 commented Jan 17, 2024

mpirescarvalho commented Jan 17, 2024 • edited Loading

f0n51 commented Jan 17, 2024

mpirescarvalho commented Jan 17, 2024

f0n51 commented Jan 17, 2024 • edited Loading

pscheit commented Jan 17, 2024

mpirescarvalho commented Jan 17, 2024

patientx commented Jan 17, 2024

mashb1t commented Jan 17, 2024

patientx commented Jan 17, 2024 • edited Loading

mpirescarvalho commented Jan 18, 2024

mpirescarvalho commented Jan 25, 2024 • edited Loading

mpirescarvalho commented Jan 17, 2024 •

edited

Loading

mpirescarvalho commented Jan 17, 2024 •

edited

Loading

f0n51 commented Jan 17, 2024 •

edited

Loading

patientx commented Jan 17, 2024 •

edited

Loading

mpirescarvalho commented Jan 25, 2024 •

edited

Loading