conversion can fail at ControlNet on machines with 16GB of system memory #369

ssube · 2023-05-01T14:20:36Z

Converting models on a system with 16GB of system memory, without the onnx-fp16 or torch-fp16 optimizations, can lead to an out of memory error:

This appears to be related to the ControlNet conversion, which makes some sense, because that is effectively a copy of the UNet and the UNet is already the largest model. However, @HoopyFreud reports this was not happening on the previous conversion method, so it may also be related to #337

I have not documented the minimum specs yet, but would like to support 4c/16GB machines with 4GB of VRAM.

…lNet (#369)

ssube · 2023-05-15T03:01:05Z

I've confirmed this on a 16GB laptop, where converting almost always fails with an Aborted or out of memory error.

Looking at the memory profiler, the usage during conversion can easily hit 15GB:

That doesn't leave much memory for the system and will typically fail on a machine with 16GB. According to the profiler, some 6GB of memory are being used by Torch and/or CUDA.

When converting diffusion models, the full pipeline is loaded, each model is exported to ONNX, then the ONNX models are reloaded in order to be optimized (single external tensor file, fp16, etc). Recent changes for ControlNet added some additional load_model calls before the UNet has been fully unloaded, which can load a second copy of the CNet/UNet and take 3-4GB of memory.

ssube · 2023-05-16T03:55:47Z

This is working better and I was able to convert the base models on a 16GB laptop, but it can still freeze during conversion, and converting multiple models in a row seems to make that more likely. Based on the current info, it seems to be getting stuck during the cnet conversion: there is a valid unet model and the cnet folder exists, but is empty.

ssube · 2023-12-25T01:30:33Z

The new optimum-based converter should fully unload the unet before converting the cnet, with no option to share them. That should reduce memory use to the minimum possible through that code path.

ssube added status/planned issues that have been planned but not started scope/convert model/diffusion pipeline/controlnet labels May 1, 2023

ssube added this to the v0.10 milestone May 1, 2023

ssube mentioned this issue May 4, 2023

v0.10.0 release checklist #368

Closed

ssube added a commit that referenced this issue May 15, 2023

fix(api): run GC during diffusers conversion, add flag to skip Contro…

e2035c3

…lNet (#369)

ssube added status/progress issues that are in progress and have a branch and removed status/planned issues that have been planned but not started labels Jun 9, 2023

ssube closed this as completed Dec 25, 2023

ssube added status/fixed issues that have been fixed and released and removed status/progress issues that are in progress and have a branch labels Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conversion can fail at ControlNet on machines with 16GB of system memory #369

conversion can fail at ControlNet on machines with 16GB of system memory #369

ssube commented May 1, 2023

ssube commented May 15, 2023

ssube commented May 16, 2023

ssube commented Dec 25, 2023

conversion can fail at ControlNet on machines with 16GB of system memory #369

conversion can fail at ControlNet on machines with 16GB of system memory #369

Comments

ssube commented May 1, 2023

ssube commented May 15, 2023

ssube commented May 16, 2023

ssube commented Dec 25, 2023