Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training Stopped in middle #688

Closed
shayvidas opened this issue Apr 30, 2023 · 1 comment
Closed

training Stopped in middle #688

shayvidas opened this issue Apr 30, 2023 · 1 comment

Comments

@shayvidas
Copy link

shayvidas commented Apr 30, 2023

Getting error every time i'm trying to train my model

what am i doing wrong ?

here is the stack trace:

`

System Information:
System: Windows, Release: 10, Version: 10.0.19045, Machine: AMD64, Processor: Intel64 Family 6 Model 191 Stepping 2, GenuineIntel

Python Information:
Version: 3.10.11, Implementation: CPython, Compiler: MSC v.1929 64 bit (AMD64)

Virtual Environment Information:
Path: C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv

GPU Information:
Name: NVIDIA GeForce RTX 4070 Ti, VRAM: 12282 MiB

Validating that requirements are satisfied.
All requirements satisfied.
Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Loading config...
Save...
Save...
Save...
Folder 40_aevt woman: 14 images found
Folder 40_aevt woman: 560 steps
max_train_steps = 5600
stop_text_encoder_training = 0
lr_warmup_steps = 560
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="C:/Users/Pancake and Berry/Downloads/realisticVisionV20_v20.safetensors" --train_data_dir="C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\img" --reg_data_dir="C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\reg" --resolution=512,512 --output_dir="C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\model" --logging_dir="C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-5 --unet_lr=0.0001 --network_dim=128 --output_name="aevt1" --lr_scheduler_num_cycles="10" --learning_rate="0.0001" --lr_scheduler="cosine" --lr_warmup_steps="560" --train_batch_size="1" --max_train_steps="5600" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --resume="D:\lora_training_states" --bucket_reso_steps=64 --save_state --xformers --bucket_no_upscale
prepare tokenizer
Use DreamBooth method.
prepare images.
found directory C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\img\40_aevt woman contains 14 image files
found directory C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\reg\1_woman contains 560 image files
560 train images with repeating.
560 reg images.
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 1024
  bucket_reso_steps: 64
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\img\40_aevt woman"
    image_count: 14
    num_repeats: 40
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: aevt woman
    caption_extension: .caption

  [Subset 1 of Dataset 0]
    image_dir: "C:\Users\Pancake and Berry\Documents\AI\koya\Lora training Images\angelicaThreeTraining\reg\1_woman"
    image_count: 560
    num_repeats: 1
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: True
    class_tokens: woman
    caption_extension: .caption


[Dataset 0]
loading image sizes.
100%|█████████████████████████████████████████████████████████████████████████████| 574/574 [00:00<00:00, 17937.73it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (384, 640), count: 560
bucket 1: resolution (512, 512), count: 560
mean ar error (without repeats): 0.0009146341463414629
prepare accelerator
Using accelerator 0.15.0 or above.
loading model for process 0/1
load StableDiffusion checkpoint
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
loading text encoder: <All keys matched successfully>
Replace CrossAttention.forward to use xformers
[Dataset 0]
caching latents.
100%|████████████████████████████████████████████████████████████████████████████████| 574/574 [00:27<00:00, 20.59it/s]
import network module: networks.lora
create LoRA network. base dim (rank): 128, alpha: 1.0
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
prepare optimizer, data loader etc.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: Loading binary C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit AdamW optimizer | {}
resume training from local state: D:\lora_training_states
Traceback (most recent call last):
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\train_network.py", line 773, in <module>
    train(args)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\train_network.py", line 317, in train
    train_util.resume_from_local_or_hf_if_specified(accelerator, args)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\library\train_util.py", line 2354, in resume_from_local_or_hf_if_specified
    accelerator.load_state(args.resume)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1722, in load_state
    load_accelerator_state(input_dir, models, optimizers, schedulers, self.state.process_index, self.scaler)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\accelerate\checkpointing.py", line 134, in load_accelerator_state
    models[i].load_state_dict(torch.load(input_model_file, map_location="cpu"))
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\lora_training_states\\pytorch_model.bin'
Traceback (most recent call last):
  File "C:\Users\Pancake and Berry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Pancake and Berry\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "C:\Users\Pancake and Berry\Documents\AI\koya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\Pancake and Berry\\Documents\\AI\\koya\\kohya_ss\\venv\\Scripts\\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=C:/Users/Pancake and Berry/Downloads/realisticVisionV20_v20.safetensors', '--train_data_dir=C:\\Users\\Pancake and Berry\\Documents\\AI\\koya\\Lora training Images\\angelicaThreeTraining\\img', '--reg_data_dir=C:\\Users\\Pancake and Berry\\Documents\\AI\\koya\\Lora training Images\\angelicaThreeTraining\\reg', '--resolution=512,512', '--output_dir=C:\\Users\\Pancake and Berry\\Documents\\AI\\koya\\Lora training Images\\angelicaThreeTraining\\model', '--logging_dir=C:\\Users\\Pancake and Berry\\Documents\\AI\\koya\\Lora training Images\\angelicaThreeTraining\\log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=128', '--output_name=aevt1', '--lr_scheduler_num_cycles=10', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=560', '--train_batch_size=1', '--max_train_steps=5600', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--resume=D:\\lora_training_states', '--bucket_reso_steps=64', '--save_state', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

`

@bmaltais
Copy link
Owner

bmaltais commented May 6, 2023

Just for fun, try using AdamW optimizer... Look like your card has an issue with bitsandbytes...

Are you using the latest release? There is a new bitsandbytes version included. Do this:

git pull
.\setup.bat

Uninstall the previous torch version when asked, then install the torch 2 version.

bmaltais pushed a commit that referenced this issue Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants