IPEX update to PyTorch 2.1 and Bundle-in MKL & DPCPP #1772

Disty0 · 2023-12-14T23:33:22Z

Needs the dev branch of sd-scripts. (This PR is needed: kohya-ss/sd-scripts#1003)

Updates IPEX and PyTorch to 2.1.
Training a LoRa is ~%50 faster compared to PyTorch 2.0 with IPEX.

Bunles-in MKL & DPCPP so we don't have to manually install the whole OneAPI BaseKit. (Saves 20GB of disk space.)

If you get a file not found error after this update, run setup.sh --use-ipex or activate your system OneAPI manually with source /opt/intel/oneapi/setvars.sh.

Also a side note;
IPEX 2.1 should have proper Windows support but i don't have a device with Windows installed and i don't have much experience with .bat scripting either.
It would be nice if someone else steps in for adding --use-ipex to gui.bat and setup.bat. I don't plan to install Windows in the near future.

Cbender86 · 2023-12-19T21:23:36Z

This one look really nice, I'm using an Intel Arc A770 XPU and this should run a lot faster without the need for all the libs (and without setting ld_library to the correct version of the libs). But when I try to train a lora, i get a runtime error:

Traceback (most recent call last):
File "/home/cbender/kohya_ss/./train_network.py", line 1012, in
trainer.train(args)
File "/home/cbender/kohya_ss/./train_network.py", line 262, in train
vae.to(accelerator.device, dtype=vae_dtype)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: PyTorch is not linked with support for cuda devices

Which is correct...

2023-12-19 21:59:26.719764: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 21:59:26.734095: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-19 21:59:26.734140: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-19 21:59:26.734153: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-19 21:59:26.737536: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-19 21:59:26.737711: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-19 21:59:27.274643: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-12-19 21:59:27.491865: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2023-12-19 21:59:27.703009: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2023-12-19 21:59:27.737384: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2023-12-19 21:59:27.737434: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.

What am I missing?

Disty0 · 2023-12-19T22:04:54Z

As i stated in the PR, sd-scripts dev branch is required but it's not merged into kohya_ss yet.
This PR to be specific: kohya-ss/sd-scripts#1003

You can replace the library/ipex folder from the sd-script dev branch.

Cbender86 · 2023-12-20T21:13:36Z

Thanks for your comment, I habe copied over the library/ipex folder as you suggested.
With the constellation of the dev2 branch in combination with the library/ipex folder from sd-scripts dev branch I get some free() related errors, namely 'double free or corruption (out)' when trying to cache latents or, when I disable caching latents, 'free(): invalid pointer'.

Stack Trace and Params:

22:08:46-494555 INFO Start training LoRA Standard ...
22:08:46-495409 INFO Checking for duplicate image filenames in training data directory...
22:08:46-496280 INFO Valid image folder names found in: /home/cbender/kohya_ss/model/v6/name
22:08:46-496962 INFO Valid image folder names found in: /home/cbender/kohya_ss/model/v6/woman
22:08:46-497635 INFO Folder 15_name: 9 images found
22:08:46-498154 INFO Folder 15_name: 135 steps
22:08:46-498622 WARNING Regularisation images are used... Will double the number of steps required...
22:08:46-499137 INFO Total steps: 135
22:08:46-499579 INFO Train batch size: 1
22:08:46-500056 INFO Gradient accumulation steps: 3
22:08:46-500506 INFO Epoch: 20
22:08:46-500945 INFO Regulatization factor: 2
22:08:46-501397 INFO max_train_steps (135 / 1 / 3 * 20 * 2) = 1800
22:08:46-501950 INFO stop_text_encoder_training = 0
22:08:46-502431 INFO lr_warmup_steps = 0
22:08:46-502950 INFO Saving training config to
/home/cbender/kohya_ss/model/v6/output/loranamev6_20231220-220846.json...
22:08:46-503642 INFO accelerate launch --num_cpu_threads_per_process=2 "./train_network.py"
--pretrained_model_name_or_path="/home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.sa
fetensors" --train_data_dir="/home/cbender/kohya_ss/model/v6/name"
--reg_data_dir="/home/cbender/kohya_ss/model/v6/woman" --resolution="768,768"
--output_dir="/home/cbender/kohya_ss/model/v6/output" --network_alpha="128"
--save_model_as=safetensors --network_module=networks.lora --network_dim=128
--gradient_accumulation_steps=3 --output_name="loranamev6" --lr_scheduler_num_cycles="20"
--no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1"
--max_train_steps="1800" --save_every_n_epochs="1" --mixed_precision="bf16"
--save_precision="bf16" --caption_extension="txt" --cache_latents --optimizer_type="AdamW"
--max_grad_norm="1" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64
--mem_eff_attn --sdpa --bucket_no_upscale --noise_offset=0.0
/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2023-12-20 22:08:50.817566: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-20 22:08:50.831474: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-20 22:08:50.831510: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-20 22:08:50.831523: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-20 22:08:50.834612: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-20 22:08:50.834767: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-20 22:08:51.367252: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-12-20 22:08:51.587139: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
2023-12-20 22:08:51.792074: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
2023-12-20 22:08:51.825652: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
2023-12-20 22:08:51.825698: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
prepare tokenizer
Using DreamBooth method.
prepare images.
found directory /home/cbender/kohya_ss/model/v6/name/15_name contains 9 image files
found directory /home/cbender/kohya_ss/model/v6/woman/1_woman contains 50 image files
No caption file found for 50 images. Training will continue without captions for these images. If class token exists, it will be used. / 50枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0001.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0002.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0003.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0004.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0005.jpg
/home/cbender/kohya_ss/model/v6/woman/1_woman/woman_0006.jpg... and 45 more
135 train images with repeating.
50 reg images.
[Dataset 0]
batch_size: 1
resolution: (768, 768)
enable_bucket: False

[Subset 0 of Dataset 0]
image_dir: "/home/cbender/kohya_ss/model/v6/name/15_name"
image_count: 9
num_repeats: 15
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: name
caption_extension: .txt

[Subset 1 of Dataset 0]
image_dir: "/home/cbender/kohya_ss/model/v6/woman/1_woman"
image_count: 50
num_repeats: 1
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: True
class_tokens: woman
caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 15390.51it/s]
prepare dataset
preparing accelerator
loading model for process 0/1
load StableDiffusion checkpoint: /home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net:
loading vae:
loading text encoder:
Enable memory efficient attention for U-Net
import network module: networks.lora
[Dataset 0]
caching latents.
checking cache validity...
100%|██████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 1195477.95it/s]
caching latents...
0%| | 0/59 [00:00<?, ?it/s]double free or corruption (out)

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: hsw [AMD Ryzen 7 5700X 8-Core Processor]
Registry and code: 13 MB
Command: /home/cbender/kohya_ss/venv/bin/python ./train_network.py --pretrained_model_name_or_path=/home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.safetensors --train_data_dir=/home/cbender/kohya_ss/model/v6/name --reg_data_dir=/home/cbender/kohya_ss/model/v6/woman --resolution=768,768 --output_dir=/home/cbender/kohya_ss/model/v6/output --network_alpha=128 --save_model_as=safetensors --network_module=networks.lora --network_dim=128 --gradient_accumulation_steps=3 --output_name=loranamev6 --lr_scheduler_num_cycles=20 --no_half_vae --learning_rate=0.0001 --lr_scheduler=constant --train_batch_size=1 --max_train_steps=1800 --save_every_n_epochs=1 --mixed_precision=bf16 --save_precision=bf16 --caption_extension=txt --cache_latents --optimizer_type=AdamW --max_grad_norm=1 --max_data_loader_n_workers=1 --clip_skip=2 --bucket_reso_steps=64 --mem_eff_attn --sdpa --bucket_no_upscale --noise_offset=0.0
Uptime: 7.495695 s
Traceback (most recent call last):
File "/home/cbender/kohya_ss/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/home/cbender/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/cbender/kohya_ss/venv/bin/python', './train_network.py', '--pretrained_model_name_or_path=/home/cbender/kohya_ss/model/v6/icbinpICantBelieveIts_final.safetensors', '--train_data_dir=/home/cbender/kohya_ss/model/v6/name', '--reg_data_dir=/home/cbender/kohya_ss/model/v6/woman', '--resolution=768,768', '--output_dir=/home/cbender/kohya_ss/model/v6/output', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--network_dim=128', '--gradient_accumulation_steps=3', '--output_name=loranamev6', '--lr_scheduler_num_cycles=20', '--no_half_vae', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=1800', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=txt', '--cache_latents', '--optimizer_type=AdamW', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--mem_eff_attn', '--sdpa', '--bucket_no_upscale', '--noise_offset=0.0']' died with <Signals.SIGABRT: 6>.

Disty0 · 2023-12-20T21:29:24Z

Run the gui.sh like this:

DISABLE_IPEXRUN=1 ./gui.sh --use-ipex

Cbender86 · 2023-12-20T23:34:28Z

Thank you, that did it. I can confirm a significant speedup over the previous version, from 8.8s/step to 6.3s.
However, there is still a Problem with the Ram usage on WSL2, at about 300 steps 28Gigs of RAM are used up completely. See the attached screenshot. Does this also happen on a native Linux System or is it WSL2 related? See the attached screenshot.

Disty0 · 2023-12-21T00:32:30Z

Memory leak is a known issue and happens on Linux too. I couldn't figure out where it comes from. IPEX itself has this issue too.

Ipexrun reduces the leaks but it can cause errors in some systems like you had.

Disty0 · 2024-01-09T13:21:00Z

This should fix the memory leaks: #1858

You can replicate it with exporting STARTUP_CMD_ARGS="--multi-task-manager taskset --memory-allocator tcmalloc" if you don't want to wait for the PR to merge.

IPEX update to Torch 2.1 and bundle in MKL & DPCPP

7cc3045

bmaltais merged commit 1508bf9 into bmaltais:dev2 Dec 16, 2023
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPEX update to PyTorch 2.1 and Bundle-in MKL & DPCPP #1772

IPEX update to PyTorch 2.1 and Bundle-in MKL & DPCPP #1772

Disty0 commented Dec 14, 2023 •

edited

Loading

Cbender86 commented Dec 19, 2023

Disty0 commented Dec 19, 2023 •

edited

Loading

Cbender86 commented Dec 20, 2023

Disty0 commented Dec 20, 2023

Cbender86 commented Dec 20, 2023 •

edited

Loading

Disty0 commented Dec 21, 2023 •

edited

Loading

Disty0 commented Jan 9, 2024

IPEX update to PyTorch 2.1 and Bundle-in MKL & DPCPP #1772

IPEX update to PyTorch 2.1 and Bundle-in MKL & DPCPP #1772

Conversation

Disty0 commented Dec 14, 2023 • edited Loading

Cbender86 commented Dec 19, 2023

Disty0 commented Dec 19, 2023 • edited Loading

Cbender86 commented Dec 20, 2023

Disty0 commented Dec 20, 2023

Cbender86 commented Dec 20, 2023 • edited Loading

Disty0 commented Dec 21, 2023 • edited Loading

Disty0 commented Jan 9, 2024

Disty0 commented Dec 14, 2023 •

edited

Loading

Disty0 commented Dec 19, 2023 •

edited

Loading

Cbender86 commented Dec 20, 2023 •

edited

Loading

Disty0 commented Dec 21, 2023 •

edited

Loading