Ai Worker Bridge Crashing After Some Time - Likely Due to GPU Usage #297

linklight2 · 2023-09-08T00:25:32Z

When running 'horde-bridge.cmd' on my computer, Using device: CUDA 0: NVIDIA GeForce GTX 1660 SUPER, the process loads up properly and runs normally except for these two occasional errors: Model name requested SDXL_beta::stability.ai#6901 in bridgeData is unknown to us. Please check your configuration. Aborting! and This job took longer than average to process. Please consider lowering your max_power. It continually runs at minimal CPU Load, at 3%, while the GPU is almost always at a 99% load. My Gpu is the one stated above, a GTX 1660 Super. It's not the best performance wise but it is way better than what most computers run and meets the minimal vRam requirements listed. However, when I use Stable Diffusion (and now your program) on the basic settings, it will occasionally crash.

The workaround I've found for Automatic1111's web ui for stable diffusion is launching with the tag --Medvram. This tag splits the process of generating the image into 3 areas, your computer ram, the processor, and the gpu. It allows me and many others to generate AI images consistently (and quickly) without having to worry about random crashes due to some Gpu issue. I was not able to find this feature however with this project. Unfortunately this also means I am unable to leave my pc with it being a worker on for any significant period of time - and likely acts as a barrier for many others as well.

The random crashing could be due to any issue, but I have a strong hunch it's to do with my gpu running at 100% capacity all the time when your program is active. Implementing something like medvram would drastically lower the barrier of entry to the 'horde' and as such multiply its strength. I appreciate your approach of crowdsourcing AI processing to make it accessible to everyone, and I hope to one day be a part of bringing this to life

The text was updated successfully, but these errors were encountered:

tazlin · 2023-09-08T00:41:06Z

It is likely that your configuration needs to be adjusted. Features such as controlnets and Lora's will very likely not work with your card. Further, max threads should be 1, max power shouldn't be much higher than 16, and the vram to keep free option should be left at 80%.

Support can better be provided in the local workers channel on the official discord (https://discord.gg/hzgR8cc67P). If you are unable to use discord, I advise you to check the logs/trace.log file for errors and/or logs/bridge.log for other information.

linklight2 · 2023-09-08T08:50:59Z

It is likely that your configuration needs to be adjusted. Features such as controlnets and Lora's will very likely not work with your card. Further, max threads should be 1, max power shouldn't be much higher than 16, and the vram to keep free option should be left at 80%.

Support can better be provided in the local workers channel on the official discord (https://discord.gg/hzgR8cc67P). If you are unable to use discord, I advise you to check the logs/trace.log file for errors and/or logs/bridge.log for other information.

The max power is at the default, 8, max thread's is 1, vram to keep free option is 80% but that option seems to be irrelevant to how the program functions. I don't have LoRA's on, but when I checked the logs it was all python trying to allocate vram when there was none availiable. Which led to a crash. Here is the most common error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 6.00 GiB total capacity; 4.99 GiB already allocated; 0 bytes free; 5.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I run Stable Diffusion just fine with the --Medvram flag. Is that not something that can be implemented?

db0 · 2023-09-08T09:46:07Z

I run Stable Diffusion just fine with the --Medvram flag. Is that not something that can be implemented?

No, that will make it too slow to be used on the AI Horde.

Try to disable post-processing. If it's still crashing, your card just doesn't have enough VRAM to run SD fast enough for the horde

linklight2 · 2023-09-08T13:42:41Z

I run Stable Diffusion just fine with the --Medvram flag. Is that not something that can be implemented?

No, that will make it too slow to be used on the AI Horde.

Try to disable post-processing. If it's still crashing, your card just doesn't have enough VRAM to run SD fast enough for the horde

Too slow to be used on the horde? Medvram drastically reduces the requirements to run SD stably, at a very minor performance cost. I did a benchmark of generating 6 images with the same seed, step count, and model, the results are below:

6-BatchCount 1-BatchSize Medvram: 74.40s
6-BatchCount 1-BatchSize GpuOnly: 65.42s

Its only about 14% slower when run on single batches. Computers with a 'slow' image speed are already listed as 'slow' workers, but queue times for the image generation are very long, much longer than the time it takes to generate the images. This indicates that the horde does not have enough processing resources available to meet requests - having a way to lower the barrier of entry at a minor individual cost ONLY when necessary will result in a net overall gain. In addition, medvram reduces the operation requirements so much that you can use other optimizations, like increasing the batch size or resolution. Below is the same bench mark, just with a batchsize of 6.

1-BatchCount 6-BatchSize Medvram: 58.15s

I've always dreamed of making something like this, but I don't and didn't have the technical expertise to realize it. I hope you can at least consider this idea, as I believe it to be a good one.

tazlin · 2023-09-08T13:48:06Z

It sounds like you probably need to adjust some of the configuration options already present in the horde worker. The settings you're talking about are otherwise specific to a program that is not used in the horde worker, and in fact there is something similar (but not identical) already present.

If you come to the discord, more real-time troubleshooting can be provided.

db0 · 2023-09-08T13:48:41Z

There is the slow_workers, but there's also the generic stale timer, which is around 120 sec for one image at 512x512. If your worker can stay below this threshold, you should be fine.

but otherwise, what tazlin said, you can join us on discord for easier troubleshooting

linklight2 · 2023-09-08T13:52:58Z

Ok I'll do that.

linklight2 closed this as completed Sep 8, 2023

Efreak mentioned this issue Feb 26, 2024

Bounty: 2,500,000 kudos for ipadapter and new controlnets Haidra-Org/AI-Horde#362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ai Worker Bridge Crashing After Some Time - Likely Due to GPU Usage #297

Ai Worker Bridge Crashing After Some Time - Likely Due to GPU Usage #297

linklight2 commented Sep 8, 2023

tazlin commented Sep 8, 2023

linklight2 commented Sep 8, 2023

db0 commented Sep 8, 2023 •

edited

Loading

linklight2 commented Sep 8, 2023

tazlin commented Sep 8, 2023

db0 commented Sep 8, 2023

linklight2 commented Sep 8, 2023

Ai Worker Bridge Crashing After Some Time - Likely Due to GPU Usage #297

Ai Worker Bridge Crashing After Some Time - Likely Due to GPU Usage #297

Comments

linklight2 commented Sep 8, 2023

tazlin commented Sep 8, 2023

linklight2 commented Sep 8, 2023

db0 commented Sep 8, 2023 • edited Loading

linklight2 commented Sep 8, 2023

tazlin commented Sep 8, 2023

db0 commented Sep 8, 2023

linklight2 commented Sep 8, 2023

db0 commented Sep 8, 2023 •

edited

Loading