RuntimeError: CUDA out of memory. Tried to allocate 146.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) #623

Cynaxia · 2023-04-15T03:08:20Z

Folder 100_Cynaxia : 1500 steps
max_train_steps = 1500
stop_text_encoder_training = 0
lr_warmup_steps = 0
accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --pretrained_model_name_or_path="E:/stable-diffusion/stable-diffusion-webui/models/Stable-diffusion/WaifuDiffusion.ckpt" --train_data_dir="E:/LORA Training/Cynaxia Live2D w Captions/Cynaxia Live2D LoRA/image" --resolution=512,512 --output_dir="E:/LORA Training/Cynaxia Live2D w Captions/Cynaxia Live2D LoRA/model" --logging_dir="E:/LORA Training/Cynaxia Live2D w Captions/Cynaxia Live2D LoRA/model" --save_model_as=safetensors --output_name="Cynaxialive2d" --max_data_loader_n_workers="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="1500" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --mem_eff_attn --gradient_checkpointing --xformers --bucket_no_upscale
prepare tokenizer
prepare images.
found directory E:\LORA Training\Cynaxia Live2D w Captions\Cynaxia Live2D LoRA\image\100_Cynaxia contains 15 image files1500 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 1
resolution: (512, 512)
enable_bucket: False

[Subset 0 of Dataset 0]
image_dir: "E:\LORA Training\Cynaxia Live2D w Captions\Cynaxia Live2D LoRA\image\100_Cynaxia"
image_count: 15
num_repeats: 100
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: Cynaxia
caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 2499.19it/s]
prepare dataset
prepare accelerator
Using accelerator 0.15.0 or above.
load StableDiffusion checkpoint
loading u-net:
loading vae:
loading text encoder:
Replace CrossAttention.forward to use FlashAttention (not xformers)
[Dataset 0]
caching latents.
100%|██████████████████████████████████████████████████████████████████████████████████| 15/15 [00:03<00:00, 3.79it/s]
prepare optimizer, data loader etc.
use AdamW optimizer | {}
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 1500
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 1500
num epochs / epoch数: 1
batch size per device / バッチサイズ: 1
total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 1
gradient ccumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1500
steps: 0%| | 0/1500 [00:00<?, ?it/s]epoch 1/1
Traceback (most recent call last):
File "E:\Kohya\kohya_ss\train_db.py", line 435, in
train(args)
File "E:\Kohya\kohya_ss\train_db.py", line 315, in train
accelerator.backward(loss)
File "E:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1314, in backward
self.scaler.scale(loss).backward(**kwargs)
File "E:\Kohya\kohya_ss\venv\lib\site-packages\torch_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "E:\Kohya\kohya_ss\venv\lib\site-packages\torch\autograd_init_.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 146.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps: 0%| | 0/1500 [00:05<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\Cynax\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Cynax\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "E:\Kohya\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in
File "E:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "E:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
simple_launcher(args)
File "E:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['E:\Kohya\kohya_ss\venv\Scripts\python.exe', 'train_db.py', '--pretrained_model_name_or_path=E:/stable-diffusion/stable-diffusion-webui/models/Stable-diffusion/WaifuDiffusion.ckpt', '--train_data_dir=E:/LORA Training/Cynaxia Live2D w Captions/Cynaxia Live2D LoRA/image', '--resolution=512,512', '--output_dir=E:/LORA Training/Cynaxia Live2D w Captions/Cynaxia Live2D LoRA/model', '--logging_dir=E:/LORA Training/Cynaxia Live2D w Captions/Cynaxia Live2D LoRA/model', '--save_model_as=safetensors', '--output_name=Cynaxialive2d', '--max_data_loader_n_workers=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

Struggling to fix this issue, I've managed to run 768x768 training the first time , it completed 15%ish , and I've closed CMD because I didn't have enough time at that point to wait for it
When I launched exactly the same operation later , cuda memory error appeared and training won't even start
Tried going for lower res 512x512 , same thing won't even start
I've read posts on stackoverflow suggesting running command
import torch
torch.cuda.empty_cache()
But I'm a newbie and don't really know how/where to do it
any suggestions/help ?
Thanks in advance !

electricbee · 2023-04-15T08:10:55Z

Looks like you're trying to train a LORA but accidentally started the Dreambooth trainer instead
I have done the same thing an embarrassing amount of times

Cynaxia · 2023-04-15T12:27:24Z

Looks like you're trying to train a LORA but accidentally started the Dreambooth trainer instead I have done the same thing an embarrassing amount of times

Yeah apparently so, late night rush to fix the problem did it's thing (:
Made sure to run LORA this time and it worked first try , thanks for your help !

…windows (#623) * ADD libbitsandbytes.dll for 0.38.1 * Delete libbitsandbytes_cuda116.dll * Delete cextension.py * add main.py * Update requirements.txt for bitsandbytes 0.38.1 * Update README.md for bitsandbytes-windows * Update README-ja.md for bitsandbytes 0.38.1 * Update main.py for return cuda118 * Update train_util.py for lion8bit * Update train_README-ja.md for lion8bit * Update train_util.py for add DAdaptAdan and DAdaptSGD * Update train_util.py for DAdaptadam * Update train_network.py for dadapt * Update train_README-ja.md for DAdapt * Update train_util.py for DAdapt * Update train_network.py for DAdaptAdaGrad * Update train_db.py for DAdapt * Update fine_tune.py for DAdapt * Update train_textual_inversion.py for DAdapt * Update train_textual_inversion_XTI.py for DAdapt * Revert "Merge branch 'qinglong' into main" This reverts commit b65c023083d6d1e8a30eb42eddd603d1aac97650, reversing changes made to f6fda20caf5e773d56bcfb5c4575c650bb85362b. * Revert "Update requirements.txt for bitsandbytes 0.38.1" This reverts commit 83abc60dfaddb26845f54228425b98dd67997528. * Revert "Delete cextension.py" This reverts commit 3ba4dfe046874393f2a022a4cbef3628ada35391. * Revert "Update README.md for bitsandbytes-windows" This reverts commit 4642c52086b5e9791233007e2fdfd97f832cd897. * Revert "Update README-ja.md for bitsandbytes 0.38.1" This reverts commit fa6d7485ac067ebc49e6f381afdb8dd2f12caa8f. * Update train_util.py for DAdaptLion * Update train_README-zh.md for dadaptlion * Update train_README-ja.md for DAdaptLion * add DAdatpt V3 * Alignment * Update train_util.py for experimental * Update train_util.py V3 * Update train_util.py * Update requirements.txt * Update train_README-zh.md * Update train_README-ja.md * Update train_util.py fix * Update train_util.py * support Prodigy * add lower * Update main.py * support PagedAdamW8bit/PagedLion8bit * Update requirements.txt * update for PageAdamW8bit and PagedLion8bit * Revert * revert main * Update train_util.py * update for bitsandbytes 0.39.1 * Update requirements.txt * vram leak fix --------- Co-authored-by: Pam <[email protected]>

Cynaxia closed this as completed Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA out of memory. Tried to allocate 146.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) #623

RuntimeError: CUDA out of memory. Tried to allocate 146.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) #623

Cynaxia commented Apr 15, 2023 •

edited

Loading

electricbee commented Apr 15, 2023

Cynaxia commented Apr 15, 2023

RuntimeError: CUDA out of memory. Tried to allocate 146.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) #623

RuntimeError: CUDA out of memory. Tried to allocate 146.00 MiB (GPU 0; 8.00 GiB total capacity; 7.21 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) #623

Comments

Cynaxia commented Apr 15, 2023 • edited Loading

electricbee commented Apr 15, 2023

Cynaxia commented Apr 15, 2023

Cynaxia commented Apr 15, 2023 •

edited

Loading