Replies: 1 comment 2 replies
-
SOLVED: the problem was not Kohya and an update, but due to incorrect settings and changes made unintentionally. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I had configured Kohya with which I ran LoRA training, with an 8GB GeForce 2070.
I left all the basic settings and only enabled Gradient checkpointing and Memory effecient attention, so I can process with 8GB, as I have seen on some guides.
Today I updated Kohya and since then I get this error, which if I understand correctly is due to VRAM overrun.
Has anything changed?
I have tried increasing the Train batch size but it doesn't change the result, is there anything else now to enable for those who have so little VRAM?
Of course I tried reinstalling everything, but nothing changes ...
Thanks
The error:
Traceback (most recent call last):
File "D:\AI\kohya_ss\train_db.py", line 469, in
train(args)
File "D:\AI\kohya_ss\train_db.py", line 325, in train
accelerator.backward(loss)
File "D:\AI\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1681, in backward
self.scaler.scale(loss).backward(**kwargs)
File "D:\AI\kohya_ss\venv\lib\site-packages\torch_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "D:\AI\kohya_ss\venv\lib\site-packages\torch\autograd_init_.py", line 173, in backward
Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "D:\AI\kohya_ss\venv\lib\site-packages\torch\autograd\function.py", line 253, in apply
return user_fn(self, *args)
File "D:\AI\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "D:\AI\kohya_ss\venv\lib\site-packages\torch\autograd_init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 8.00 GiB total capacity; 7.17 GiB already allocated; 0 bytes free; 7.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps: 0%| | 0/2000 [00:02<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\danie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\danie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\AI\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in
File "D:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "D:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 923, in launch_command
simple_launcher(args)
File "D:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 579, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\AI\kohya_ss\venv\Scripts\python.exe', 'train_db.py', '--enable_bucket', '--pretrained_model_name_or_path=D:/AI/stable-diffusion-webui/models/Stable-diffusion/SD_v1.5_ema-pruned.safetensors', '--train_data_dir=C:/Users/danie/Desktop/Fantozzi/image', '--resolution=512,512', '--output_dir=C:/Users/danie/Desktop/Fantozzi/model', '--logging_dir=C:/Users/danie/Desktop/Fantozzi/log', '--save_model_as=safetensors', '--output_name=last', '--max_data_loader_n_workers=0', '--learning_rate=1e-05', '--lr_scheduler=cosine', '--lr_warmup_steps=200', '--train_batch_size=1', '--max_train_steps=2000', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.
Beta Was this translation helpful? Give feedback.
All reactions