-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change base image to nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 #12
Conversation
Not working yet due to an error from bitsandbytes library. Maybe related to CUDA 11.8 (This is just an experimental training log. Ignore missing caption file caution). $ sudo docker build -t aoirint/sd_scripts .
$ sudo docker run --rm --gpus all \
-v "./base_model:/base_model" \
-v "./work:/work" \
-v "./cache/huggingface/hub:/home/user/.cache/huggingface/hub" \
aoirint/sd_scripts \
train_network.py \
--config_file /work/train_config/train_20231103.1/config.toml
2023-11-03 09:57:13.728649: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-03 09:57:13.856718: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-03 09:57:14.544165: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-03 09:57:14.544257: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-03 09:57:14.544266: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-11-03 09:57:17.093996: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-03 09:57:17.218575: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-03 09:57:17.889476: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-03 09:57:17.889556: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-03 09:57:17.889567: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Loading settings from /work/train_config/train_20231103.1/config.toml...
/work/train_config/train_20231103.1/config
prepare tokenizer
update token length: 150
Using DreamBooth method.
prepare images.
found directory /work/my_dataset-20230715.1/train_img/10_shs girl contains 51 image files
No caption file found for 51 images. Training will continue without captions for these images. If class token exists, it will be used. / 51枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。
/work/my_dataset-20230715.1/train_img/10_shs girl/0001.png
/work/my_dataset-20230715.1/train_img/10_shs girl/0002.png
/work/my_dataset-20230715.1/train_img/10_shs girl/0003.png
/work/my_dataset-20230715.1/train_img/10_shs girl/0004.png
/work/my_dataset-20230715.1/train_img/10_shs girl/0005.png
/work/my_dataset-20230715.1/train_img/10_shs girl/0006.png... and 46 more
found directory /work/my_dataset-20230715.1/reg_img/1_1girl contains 500 image files
No caption file found for 500 images. Training will continue without captions for these images. If class token exists, it will be used. / 500枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。
/work/my_dataset-20230715.1/reg_img/1_1girl/transparent_1.png
/work/my_dataset-20230715.1/reg_img/1_1girl/transparent_10.png
/work/my_dataset-20230715.1/reg_img/1_1girl/transparent_100.png
/work/my_dataset-20230715.1/reg_img/1_1girl/transparent_101.png
/work/my_dataset-20230715.1/reg_img/1_1girl/transparent_102.png
/work/my_dataset-20230715.1/reg_img/1_1girl/transparent_103.png... and 495 more
510 train images with repeating.
500 reg images.
[Dataset 0]
batch_size: 2
resolution: (512, 512)
enable_bucket: True
min_bucket_reso: 320
max_bucket_reso: 960
bucket_reso_steps: 64
bucket_no_upscale: False
[Subset 0 of Dataset 0]
image_dir: "/work/my_dataset-20230715.1/train_img/10_shs girl"
image_count: 51
num_repeats: 10
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.05
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: shs girl
caption_extension: .txt
[Subset 1 of Dataset 0]
image_dir: "/work/my_dataset-20230715.1/reg_img/1_1girl"
image_count: 500
num_repeats: 1
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.05
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: True
class_tokens: 1girl
caption_extension: .txt
[Dataset 0]
loading image sizes.
0%| | 0/551 [00:00<?, ?it/s]make buckets
100%|██████████| 551/551 [00:00<00:00, 6029.49it/s]
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (320, 704), count: 20
bucket 1: resolution (320, 768), count: 20
bucket 2: resolution (384, 640), count: 60
bucket 3: resolution (448, 576), count: 140
bucket 4: resolution (512, 512), count: 780
mean ar error (without repeats): 0.002443827835159254
preparing accelerator
Using accelerator 0.15.0 or above.
loading model for process 0/1
load StableDiffusion checkpoint: /base_model/wd-1-5-beta2-fp32.safetensors
/home/user/.local/lib/python3.10/site-packages/safetensors/torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
loading text encoder: <All keys matched successfully>
CrossAttention.forward has been replaced to enable xformers.
import network module: lycoris.kohya
[Dataset 0]
caching latents.
0it [00:00, ?it/s]
Using rank adaptation algo: full
Disable conv layer
Use Dropout value: 0.0
Create LyCORIS Module
create LyCORIS for Text Encoder: 138 modules.
Create LyCORIS Module
create LyCORIS for U-Net: 256 modules.
module type table: {'FullModule': 330, 'NormModule': 64}
enable LyCORIS for text encoder
enable LyCORIS for U-Net
preparing optimizer, data loader etc.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/user/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64'), PosixPath('/home/user/.local/lib/python3.10/site-packages/cv2/../../lib64')}
warn(
/home/user/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:105: UserWarning: /home/user/.local/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary /home/user/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/user/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn(
use 8-bit AdamW optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.99)}
override steps. steps for 2 epochs is / 指定エポックまでのステップ数: 1020
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 510
num reg images / 正則化画像の数: 500
num batches per epoch / 1epochのバッチ数: 510
num epochs / epoch数: 2
batch size per device / バッチサイズ: 2
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1020
steps: 0%| | 0/1020 [00:00<?, ?it/s]
epoch 1/2
/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py:459: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:80.)
return F.conv2d(input, weight, bias, self.stride,
Blocksparse is not available: the current GPU does not expose Tensor cores
Traceback (most recent call last):
File "/code/sd-scripts/train_network.py", line 873, in <module>
train(args)
File "/code/sd-scripts/train_network.py", line 688, in train
optimizer.step()
File "/home/user/.local/lib/python3.10/site-packages/accelerate/optimizer.py", line 134, in step
self.scaler.step(self.optimizer, closure)
File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step
retval = optimizer.step(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 265, in step
self.update_step(group, p, gindex, pindex)
File "/home/user/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 506, in update_step
F.optimizer_update_8bit_blockwise(
File "/home/user/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 858, in optimizer_update_8bit_blockwise
str2optimizer8bit_blockwise[optimizer_name][0](
NameError: name 'str2optimizer8bit_blockwise' is not defined
steps: 0%| | 0/1020 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/user/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/home/user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/python/bin/python3.10', 'train_network.py', '--config_file', '/work/train_config/train_20231103.1/config.toml']' returned non-zero exit status 1. |
sd-scripts側の |
There is no
|
bitsandbytes 0.41.1にアップデートするとエラーがなくなったのでよさそう |
runtime
base image for size reduction #11