Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPEX update to PyTorch 2.1 and Bundle-in MKL & DPCPP #1772

Merged
merged 1 commit into from
Dec 16, 2023

Conversation

Disty0
Copy link
Contributor

@Disty0 Disty0 commented Dec 14, 2023

Needs the dev branch of sd-scripts. (This PR is needed: kohya-ss/sd-scripts#1003)

Updates IPEX and PyTorch to 2.1.
Training a LoRa is ~%50 faster compared to PyTorch 2.0 with IPEX.

Bunles-in MKL & DPCPP so we don't have to manually install the whole OneAPI BaseKit. (Saves 20GB of disk space.)

If you get a file not found error after this update, run setup.sh --use-ipex or activate your system OneAPI manually with source /opt/intel/oneapi/setvars.sh.

Also a side note;
IPEX 2.1 should have proper Windows support but i don't have a device with Windows installed and i don't have much experience with .bat scripting either.
It would be nice if someone else steps in for adding --use-ipex to gui.bat and setup.bat. I don't plan to install Windows in the near future.

@bmaltais bmaltais merged commit 1508bf9 into bmaltais:dev2 Dec 16, 2023
1 check failed
@Cbender86
Copy link

This one look really nice, I'm using an Intel Arc A770 XPU and this should run a lot faster without the need for all the libs (and without setting ld_library to the correct version of the libs). But when I try to train a lora, i get a runtime error:

Traceback (most recent call last):
File "/home/cbender/kohya_ss/./train_network.py", line 1012, in
trainer.train(args)
File "/home/cbender/kohya_ss/./train_network.py", line 262, in train
vae.to(accelerator.device, dtype=vae_dtype)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: PyTorch is not linked with support for cuda devices

Which is correct...

2023-12-19 21:59:26.719764: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 21:59:26.734095: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-19 21:59:26.734140: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-19 21:59:26.734153: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-19 21:59:26.737536: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 21:59:26.737711: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-19 21:59:27.274643: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-12-19 21:59:27.491865: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2023-12-19 21:59:27.703009: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2023-12-19 21:59:27.737384: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2023-12-19 21:59:27.737434: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.

What am I missing?

@Disty0
Copy link
Contributor Author

Disty0 commented Dec 19, 2023

As i stated in the PR, sd-scripts dev branch is required but it's not merged into kohya_ss yet.
This PR to be specific: kohya-ss/sd-scripts#1003

You can replace the library/ipex folder from the sd-script dev branch.

@Cbender86
Copy link

Thanks for your comment, I habe copied over the library/ipex folder as you suggested.
With the constellation of the dev2 branch in combination with the library/ipex folder from sd-scripts dev branch I get some free() related errors, namely 'double free or corruption (out)' when trying to cache latents or, when I disable caching latents, 'free(): invalid pointer'.

Stack Trace and Params:

22:08:46-494555 INFO Start training LoRA Standard ...
22:08:46-495409 INFO Checking for duplicate image filenames in training data directory...
22:08:46-496280 INFO Valid image folder names found in: /home/cbender/kohya_ss/model/v6/name
22:08:46-496962 INFO Valid image folder names found in: /home/cbender/kohya_ss/model/v6/woman
22:08:46-497635 INFO Folder 15_name: 9 images found
22:08:46-498154 INFO Folder 15_name: 135 steps
22:08:46-498622 WARNING Regularisation images are used... Will double the number of steps required...
22:08:46-499137 INFO Total steps: 135
22:08:46-499579 INFO Train batch size: 1
22:08:46-500056 INFO Gradient accumulation steps: 3
22:08:46-500506 INFO Epoch: 20
22:08:46-500945 INFO Regulatization factor: 2
22:08:46-501397 INFO max_train_steps (135 / 1 / 3 * 20 * 2) = 1800
22:08:46-501950 INFO stop_text_encoder_training = 0
22:08:46-502431 INFO lr_warmup_steps = 0
22:08:46-502950 INFO Saving training config to
/home/cbender/kohya_ss/model/v6/output/loranamev6_20231220-220846.json...
22:08:46-503642 INFO accelerate launch --num_cpu_threads_per_process=2 "./train_network.py"
--pretrained_model_name_or_path="/home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.sa
fetensors" --train_data_dir="/home/cbender/kohya_ss/model/v6/name"
--reg_data_dir="/home/cbender/kohya_ss/model/v6/woman" --resolution="768,768"
--output_dir="/home/cbender/kohya_ss/model/v6/output" --network_alpha="128"
--save_model_as=safetensors --network_module=networks.lora --network_dim=128
--gradient_accumulation_steps=3 --output_name="loranamev6" --lr_scheduler_num_cycles="20"
--no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1"
--max_train_steps="1800" --save_every_n_epochs="1" --mixed_precision="bf16"
--save_precision="bf16" --caption_extension="txt" --cache_latents --optimizer_type="AdamW"
--max_grad_norm="1" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64
--mem_eff_attn --sdpa --bucket_no_upscale --noise_offset=0.0
/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2023-12-20 22:08:50.817566: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-20 22:08:50.831474: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-20 22:08:50.831510: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-20 22:08:50.831523: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-20 22:08:50.834612: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-20 22:08:50.834767: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-20 22:08:51.367252: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-12-20 22:08:51.587139: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2023-12-20 22:08:51.792074: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2023-12-20 22:08:51.825652: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2023-12-20 22:08:51.825698: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
prepare tokenizer
Using DreamBooth method.
prepare images.
found directory /home/cbender/kohya_ss/model/v6/name/15_name contains 9 image files
found directory /home/cbender/kohya_ss/model/v6/woman/1_woman contains 50 image files
No caption file found for 50 images. Training will continue without captions for these images. If class token exists, it will be used. / 50枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を 続行します。class tokenが存在する場合はそれを使います。
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0001.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0002.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0003.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0004.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0005.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0006.jpg... and 45 more
135 train images with repeating.
50 reg images.
[Dataset 0]
batch_size: 1
resolution: (768, 768)
enable_bucket: False

[Subset 0 of Dataset 0]
image_dir: "/home/cbender/kohya_ss/model/v6/name/15_name"
image_count: 9
num_repeats: 15
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: name
caption_extension: .txt

[Subset 1 of Dataset 0]
image_dir: "/home/cbender/kohya_ss/model/v6/woman/1_woman"
image_count: 50
num_repeats: 1
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: True
class_tokens: woman
caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 15390.51it/s]
prepare dataset
preparing accelerator
loading model for process 0/1
load StableDiffusion checkpoint: /home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net:
loading vae:
loading text encoder:
Enable memory efficient attention for U-Net
import network module: networks.lora
[Dataset 0]
caching latents.
checking cache validity...
100%|██████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 1195477.95it/s]
caching latents...
0%| | 0/59 [00:00<?, ?it/s]double free or corruption (out)

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: hsw [AMD Ryzen 7 5700X 8-Core Processor]
Registry and code: 13 MB
Command: /home/cbender/kohya_ss/venv/bin/python ./train_network.py --pretrained_model_name_or_path=/home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.safetensors --train_data_dir=/home/cbender/kohya_ss/model/v6/name --reg_data_dir=/home/cbender/kohya_ss/model/v6/woman --resolution=768,768 --output_dir=/home/cbender/kohya_ss/model/v6/output --network_alpha=128 --save_model_as=safetensors --network_module=networks.lora --network_dim=128 --gradient_accumulation_steps=3 --output_name=loranamev6 --lr_scheduler_num_cycles=20 --no_half_vae --learning_rate=0.0001 --lr_scheduler=constant --train_batch_size=1 --max_train_steps=1800 --save_every_n_epochs=1 --mixed_precision=bf16 --save_precision=bf16 --caption_extension=txt --cache_latents --optimizer_type=AdamW --max_grad_norm=1 --max_data_loader_n_workers=1 --clip_skip=2 --bucket_reso_steps=64 --mem_eff_attn --sdpa --bucket_no_upscale --noise_offset=0.0
Uptime: 7.495695 s
Traceback (most recent call last):
File "/home/cbender/kohya_ss/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/cbender/kohya_ss/venv/bin/python', './train_network.py', '--pretrained_model_name_or_path=/home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.safetensors', '--train_data_dir=/home/cbender/kohya_ss/model/v6/name', '--reg_data_dir=/home/cbender/kohya_ss/model/v6/woman', '--resolution=768,768', '--output_dir=/home/cbender/kohya_ss/model/v6/output', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--network_dim=128', '--gradient_accumulation_steps=3', '--output_name=loranamev6', '--lr_scheduler_num_cycles=20', '--no_half_vae', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=1800', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=txt', '--cache_latents', '--optimizer_type=AdamW', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--mem_eff_attn', '--sdpa', '--bucket_no_upscale', '--noise_offset=0.0']' died with <Signals.SIGABRT: 6>.

@Disty0
Copy link
Contributor Author

Disty0 commented Dec 20, 2023

Run the gui.sh like this:

DISABLE_IPEXRUN=1 ./gui.sh --use-ipex

@Cbender86
Copy link

Cbender86 commented Dec 20, 2023

Thank you, that did it. I can confirm a significant speedup over the previous version, from 8.8s/step to 6.3s.
However, there is still a Problem with the Ram usage on WSL2, at about 300 steps 28Gigs of RAM are used up completely. See the attached screenshot. Does this also happen on a native Linux System or is it WSL2 related? See the attached screenshot.
Screenshot 2023-12-21 000208

@Disty0
Copy link
Contributor Author

Disty0 commented Dec 21, 2023

Memory leak is a known issue and happens on Linux too. I couldn't figure out where it comes from. IPEX itself has this issue too.

Ipexrun reduces the leaks but it can cause errors in some systems like you had.

@Disty0
Copy link
Contributor Author

Disty0 commented Jan 9, 2024

This should fix the memory leaks: #1858

You can replicate it with exporting STARTUP_CMD_ARGS="--multi-task-manager taskset --memory-allocator tcmalloc" if you don't want to wait for the PR to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants