bmaltais · bmaltais · Aug 5, 2023 · Jul 17, 2023 · Jul 22, 2023 · Jul 23, 2023
diff --git a/.release b/.release
@@ -1 +1 @@
-v21.8.5
+v21.8.6
diff --git a/README-ja.md b/README-ja.md
@@ -125,6 +125,16 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsa
 
 アップグレード時には`pip install .`でこのリポジトリを更新し、必要に応じて他のパッケージもアップグレードしてください。
 
+### オプション：PagedAdamW8bitとPagedLion8bitを使う
+
+PagedAdamW8bitとPagedLion8bitを使う場合には`bitsandbytes`を0.39.0以降にアップグレードする必要があります。`bitsandbytes`をアンインストールし、Windows環境では例えば[こちら](https://github.com/jllllll/bitsandbytes-windows-webui)などからWindows版のwhlファイルをインストールしてください。たとえば以下のような手順になります。
+
+```powershell
+pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.39.1-py3-none-win_amd64.whl
+```
+
+アップグレード時には`pip install .`でこのリポジトリを更新し、必要に応じて他のパッケージもアップグレードしてください。
+
 ## アップグレード
 
 新しいリリースがあった場合、以下のコマンドで更新できます。

diff --git a/README.md b/README.md
@@ -72,6 +72,17 @@ First SDXL Tutorial: [First Ever SDXL Training With Kohya LoRA - Stable Diffusio
 
 The feature of SDXL training is now available in sdxl branch as an experimental feature. 
 
+Aug 4, 2023: The feature will be merged into the main branch soon. Following are the changes from the previous version.
+
+- `bitsandbytes` is now optional. Please install it if you want to use it. The insructions are in the later section.
+- `albumentations` is not required anymore.
+- An issue for pooled output for Textual Inversion training is fixed.
+- `--v_pred_like_loss ratio` option is added. This option adds the loss like v-prediction loss in SDXL training. `0.1` means that the loss is added 10% of the v-prediction loss. The default value is None (disabled).
+  - In v-prediction, the loss is higher in the early timesteps (near the noise). This option can be used to increase the loss in the early timesteps.
+- Arbitrary options can be used for Diffusers' schedulers. For example `--lr_scheduler_args "lr_end=1e-8"`.
+- `sdxl_gen_imgs.py` supports batch size > 1.
+- Fix ControlNet to work with attention couple and reginal LoRA in `gen_img_diffusers.py`.
+
 Summary of the feature:
 
 - `tools/cache_latents.py` is added. This script can be used to cache the latents to disk in advance. 
@@ -115,12 +126,17 @@ Summary of the feature:
 #### Tips for SDXL training
 
 - The default resolution of SDXL is 1024x1024.
-- The fine-tuning can be done with 24GB GPU memory with the batch size of 1. For 24GB GPU, the following options are recommended:
+- The fine-tuning can be done with 24GB GPU memory with the batch size of 1. For 24GB GPU, the following options are recommended __for the fine-tuning with 24GB GPU memory__:
   - Train U-Net only.
   - Use gradient checkpointing.
   - Use `--cache_text_encoder_outputs` option and caching latents.
   - Use Adafactor optimizer. RMSprop 8bit or Adagrad 8bit may work. AdamW 8bit doesn't seem to work.
-- The LoRA training can be done with 12GB GPU memory.
+- The LoRA training can be done with 8GB GPU memory (10GB recommended). For reducing the GPU memory usage, the following options are recommended:
+  - Train U-Net only.
+  - Use gradient checkpointing.
+  - Use `--cache_text_encoder_outputs` option and caching latents.
+  - Use one of 8bit optimizers or Adafactor optimizer.
+  - Use lower dim (-8 for 8GB GPU).
 - `--network_train_unet_only` option is highly recommended for SDXL LoRA. Because SDXL has two text encoders, the result of the training will be unexpected.
 - PyTorch 2 seems to use slightly less GPU memory than PyTorch 1.
 - `--bucket_reso_steps` can be set to 32 instead of the default value 64. Smaller values than 32 will not work for SDXL training.
@@ -480,16 +496,12 @@ If you come across a `FileNotFoundError`, it is likely due to an installation is
 
 ## Change History
 
-* 2023/07/27 (v21.8.4)
-  - Relocate LR number of cycles and LR power options
-  - Add missing LR number of cycles and LR power to Dreambooth and TI scripts
-  - Fix issue with conv_block_dims and conv_block_alphas
-  - Fix 0 noise offset issue
-  - Implement Stop training button on LoRA and other training tabs
-  - Update LyCORIS network release to fix an issue with the LoCon extraction.
-
-* 2023/07/18 (v21.8.3)
-  - Update to latest sd-scripts sdxl code base
-  - Fix typo: https://github.com/bmaltais/kohya_ss/issues/1205
-  - Add min and max resolution parameter for buckets
-  - Add colab notebook from https://github.com/camenduru/kohya_ss-colab
+* 2023/08/05 (v21.8.6)
+  - Merge latest sd-scripts updates.
+  - Allow DB training on SDXL models. Unsupported but appear to work.
+  - Fix finetuning latent caching issue when doing SDXL models in fp16
+  - Add SDXL merge lora support. You can now merge LoRAs into an SDXL checkpoint.
+  - Add SDPA CrossAttention option to trainers.
+  - Merge latest kohya_ss sd-scripts code
+  - Fix Dreambooth support for SDXL training
+  - Update to latest bitsandbytes release. New optional install option for bitsandbytes versions.
diff --git a/bitsandbytes_windows/libbitsandbytes_cuda118.dll b/bitsandbytes_windows/libbitsandbytes_cuda118.dll
diff --git a/bitsandbytes_windows/main.py b/bitsandbytes_windows/main.py
@@ -163,4 +163,4 @@ def get_binary_name():
 
     binary_name = get_binary_name()
 
-    return binary_name
+    return binary_name
diff --git a/docs/train_README-ja.md b/docs/train_README-ja.md
@@ -609,10 +609,12 @@ masterpiece, best quality, 1boy, in business suit, standing at street, looking b
     - AdamW : [torch.optim.AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)
     - 過去のバージョンのオプション未指定時と同じ
     - AdamW8bit : 引数は同上
+    - PagedAdamW8bit : 引数は同上
     - 過去のバージョンの--use_8bit_adam指定時と同じ
     - Lion : https://github.com/lucidrains/lion-pytorch
     - 過去のバージョンの--use_lion_optimizer指定時と同じ
     - Lion8bit : 引数は同上
+    - PagedLion8bit : 引数は同上
     - SGDNesterov : [torch.optim.SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html), nesterov=True
     - SGDNesterov8bit : 引数は同上
     - DAdaptation(DAdaptAdamPreprint) : https://github.com/facebookresearch/dadaptation

diff --git a/docs/train_README-zh.md b/docs/train_README-zh.md
@@ -546,9 +546,12 @@ masterpiece, best quality, 1boy, in business suit, standing at street, looking b
     -- 指定优化器类型。您可以指定
     - AdamW : [torch.optim.AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)
     - 与过去版本中未指定选项时相同
-    - AdamW8bit : 同上
+    - AdamW8bit : 参数同上
+    - PagedAdamW8bit : 参数同上
     - 与过去版本中指定的 --use_8bit_adam 相同
     - Lion : https://github.com/lucidrains/lion-pytorch
+    - Lion8bit : 参数同上
+    - PagedLion8bit : 参数同上
     - 与过去版本中指定的 --use_lion_optimizer 相同
     - SGDNesterov : [torch.optim.SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html), nesterov=True
     - SGDNesterov8bit : 参数同上

diff --git a/dreambooth_gui.py b/dreambooth_gui.py
@@ -79,6 +79,7 @@ def save_configuration(
     enable_bucket,
     gradient_checkpointing,
     full_fp16,
+    full_bf16,
     no_token_padding,
     stop_text_encoder_training,
     min_bucket_reso,
@@ -192,6 +193,7 @@ def open_configuration(
     enable_bucket,
     gradient_checkpointing,
     full_fp16,
+    full_bf16,
     no_token_padding,
     stop_text_encoder_training,
     min_bucket_reso,
@@ -304,6 +306,7 @@ def train_model(
     enable_bucket,
     gradient_checkpointing,
     full_fp16,
+    full_bf16,
     no_token_padding,
     stop_text_encoder_training_pct,
     min_bucket_reso,
@@ -410,12 +413,12 @@ def train_model(
     ):
         return
 
-    if sdxl:
-        output_message(
-            msg='TI training is not compatible with an SDXL model.',
-            headless=headless_bool,
-        )
-        return
+    # if sdxl:
+    #     output_message(
+    #         msg='Dreambooth training is not compatible with SDXL models yet..',
+    #         headless=headless_bool,
+    #     )
+    #     return
 
     # if optimizer == 'Adafactor' and lr_warmup != '0':
     #     output_message(
@@ -520,7 +523,13 @@ def train_model(
     lr_warmup_steps = round(float(int(lr_warmup) * int(max_train_steps) / 100))
     log.info(f'lr_warmup_steps = {lr_warmup_steps}')
 
-    run_cmd = f'accelerate launch --num_cpu_threads_per_process={num_cpu_threads_per_process} "train_db.py"'
+    # run_cmd = f'accelerate launch --num_cpu_threads_per_process={num_cpu_threads_per_process} "train_db.py"'
+    run_cmd = f'accelerate launch --num_cpu_threads_per_process={num_cpu_threads_per_process}'
+    if sdxl:
+        run_cmd += f' "./sdxl_train.py"'
+    else:
+        run_cmd += f' "./train_db.py"'
+
     if v2:
         run_cmd += ' --v2'
     if v_parameterization:
@@ -551,6 +560,8 @@ def train_model(
     #     run_cmd += f' --resume={resume}'
     if not float(prior_loss_weight) == 1.0:
         run_cmd += f' --prior_loss_weight={prior_loss_weight}'
+    if full_bf16:
+        run_cmd += ' --full_bf16'
     if not vae == '':
         run_cmd += f' --vae="{vae}"'
     if not output_name == '':
@@ -696,6 +707,9 @@ def dreambooth_tab(
                 lr_scheduler_value='cosine',
                 lr_warmup_value='10',
             )
+            full_bf16 = gr.Checkbox(
+                label='Full bf16', value = False
+            )
             with gr.Accordion('Advanced Configuration', open=False):
                 advanced_training = AdvancedTraining(headless=headless)
                 advanced_training.color_aug.change(
@@ -765,6 +779,7 @@ def dreambooth_tab(
             basic_training.enable_bucket,
             advanced_training.gradient_checkpointing,
             advanced_training.full_fp16,
+            full_bf16,
             advanced_training.no_token_padding,
             basic_training.stop_text_encoder_training,
             basic_training.min_bucket_reso,

diff --git a/examples/finetune_latent.ps1 b/examples/finetune_latent.ps1
@@ -0,0 +1,11 @@
+# Command 1: merge_captions_to_metadata.py
+$captionExtension = "--caption_extension=.txt"
+$sourceDir1 = "d:\test\1_1960-1969"
+$targetFile1 = "d:\test\1_1960-1969/meta_cap.json"
+
+# Command 2: prepare_buckets_latents.py
+$targetLatentFile = "d:\test\1_1960-1969/meta_lat.json"
+$modelFile = "E:\models\sdxl\sd_xl_base_0.9.safetensors"
+
+./venv/Scripts/python.exe finetune/merge_captions_to_metadata.py $captionExtension $sourceDir1 $targetFile1 --full_path
+./venv/Scripts/python.exe finetune/prepare_buckets_latents.py $sourceDir1 $targetFile1 $targetLatentFile $modelFile --batch_size=4 --max_resolution=1024,1024 --min_bucket_reso=64 --max_bucket_reso=2048 --mixed_precision=bf16 --full_path
diff --git a/finetune_gui.py b/finetune_gui.py
@@ -423,8 +423,13 @@ def train_model(
             run_cmd += f' --full_path'
 
         log.info(run_cmd)
-
-        executor.execute_command(run_cmd=run_cmd)
+
+        if not print_only_bool:
+            # Run the command
+            if os.name == 'posix':
+                os.system(run_cmd)
+            else:
+                subprocess.run(run_cmd)
 
     # create images buckets
     if generate_image_buckets:
@@ -442,12 +447,18 @@ def train_model(
         #     run_cmd += f' --flip_aug'
         if full_path:
             run_cmd += f' --full_path'
+        if sdxl_no_half_vae:
+            log.info('Using mixed_precision = no because no half vae is selected...')
+            run_cmd += f' --mixed_precision="no"'
 
         log.info(run_cmd)
 
         if not print_only_bool:
             # Run the command
-            executor.execute_command(run_cmd=run_cmd)
+            if os.name == 'posix':
+                os.system(run_cmd)
+            else:
+                subprocess.run(run_cmd)
 
     image_num = len(
         [

diff --git a/gen_img_diffusers.py b/gen_img_diffusers.py
@@ -937,18 +937,24 @@ def __call__(
         if self.control_nets:
             guided_hints = original_control_net.get_guided_hints(self.control_nets, num_latent_input, batch_size, clip_guide_images)
 
+        if reginonal_network:
+            num_sub_and_neg_prompts = len(text_embeddings) // batch_size
+            # last subprompt and negative prompt
+            text_emb_last = []
+            for j in range(batch_size):
+                text_emb_last.append(text_embeddings[(j + 1) * num_sub_and_neg_prompts - 2])
+                text_emb_last.append(text_embeddings[(j + 1) * num_sub_and_neg_prompts - 1])
+            text_emb_last = torch.stack(text_emb_last)
+        else:
+            text_emb_last = text_embeddings
+
         for i, t in enumerate(tqdm(timesteps)):
             # expand the latents if we are doing classifier free guidance
             latent_model_input = latents.repeat((num_latent_input, 1, 1, 1))
             latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
 
             # predict the noise residual
             if self.control_nets and self.control_net_enabled:
-                if reginonal_network:
-                    num_sub_and_neg_prompts = len(text_embeddings) // batch_size
-                    text_emb_last = text_embeddings[num_sub_and_neg_prompts - 2 :: num_sub_and_neg_prompts]  # last subprompt
-                else:
-                    text_emb_last = text_embeddings
                 noise_pred = original_control_net.call_unet_and_control_net(
                     i,
                     num_latent_input,
@@ -958,6 +964,7 @@ def __call__(
                     i / len(timesteps),
                     latent_model_input,
                     t,
+                    text_embeddings,
                     text_emb_last,
                 ).sample
             else:
@@ -2746,6 +2753,10 @@ def resize_images(imgs, size):
         print(f"iteration {gen_iter+1}/{args.n_iter}")
         iter_seed = random.randint(0, 0x7FFFFFFF)
 
+        # shuffle prompt list
+        if args.shuffle_prompts:
+            random.shuffle(prompt_list)
+
         # バッチ処理の関数
         def process_batch(batch: List[BatchData], highres_fix, highres_1st=False):
             batch_size = len(batch)
@@ -2963,6 +2974,8 @@ def process_batch(batch: List[BatchData], highres_fix, highres_1st=False):
             for i, (image, prompt, negative_prompts, seed, clip_prompt) in enumerate(
                 zip(images, prompts, negative_prompts, seeds, clip_prompts)
             ):
+                if highres_fix:
+                    seed -= 1  # record original seed
                 metadata = PngInfo()
                 metadata.add_text("prompt", prompt)
                 metadata.add_text("seed", str(seed))
@@ -3319,6 +3332,11 @@ def setup_parser() -> argparse.ArgumentParser:
         action="store_true",
         help="use same seed for all prompts in iteration if no seed specified / 乱数seedの指定がないとき繰り返し内はすべて同じseedを使う（プロンプト間の差異の比較用）",
     )
+    parser.add_argument(
+        "--shuffle_prompts",
+        action="store_true",
+        help="shuffle prompts in iteration / 繰り返し内のプロンプトをシャッフルする",
+    )
     parser.add_argument("--fp16", action="store_true", help="use fp16 / fp16を指定し省メモリ化する")
     parser.add_argument("--bf16", action="store_true", help="use bfloat16 / bfloat16を指定し省メモリ化する")
     parser.add_argument("--xformers", action="store_true", help="use xformers / xformersを使用し高速化する")

diff --git a/library/class_advanced_training.py b/library/class_advanced_training.py
@@ -19,8 +19,12 @@ def noise_offset_type_change(noise_offset_type):
             self.no_token_padding = gr.Checkbox(
                 label='No token padding', value=False
             )
-            self.gradient_accumulation_steps = gr.Number(
-                label='Gradient accumulate steps', value='1'
+            self.gradient_accumulation_steps = gr.Slider(
+                label='Gradient accumulate steps', 
+                info='Number of updates steps to accumulate before performing a backward/update pass',
+                value='1',
+                minimum=1, maximum=120,
+                step=1
             )
             self.weighted_captions = gr.Checkbox(
                 label='Weighted captions', value=False
@@ -101,13 +105,15 @@ def noise_offset_type_change(noise_offset_type):
             # use_8bit_adam = gr.Checkbox(
             #     label='Use 8bit adam', value=False, visible=False
             # )
-            self.xformers = gr.Checkbox(label='Use xformers', value=True)
+            # self.xformers = gr.Checkbox(label='Use xformers', value=True, info='Use xformers for CrossAttention')
+            self.xformers = gr.Dropdown(label='CrossAttention', choices=["none", "sdpa", "xformers"], value='xformers')
             self.color_aug = gr.Checkbox(label='Color augmentation', value=False)
             self.flip_aug = gr.Checkbox(label='Flip augmentation', value=False)
             self.min_snr_gamma = gr.Slider(
                 label='Min SNR gamma', value=0, minimum=0, maximum=20, step=1
             )
         with gr.Row():
+            # self.sdpa = gr.Checkbox(label='Use sdpa', value=False, info='Use sdpa for CrossAttention')
             self.bucket_no_upscale = gr.Checkbox(
                 label="Don't upscale bucket resolution", value=True
             )

diff --git a/library/class_basic_training.py b/library/class_basic_training.py
@@ -107,6 +107,8 @@ def __init__(
                     'DAdaptSGD',
                     'Lion',
                     'Lion8bit',
+                    "PagedAdamW8bit",
+                    "PagedLion8bit",
                     'Prodigy',
                     'SGDNesterov',
                     'SGDNesterov8bit',

diff --git a/library/class_dreambooth_gui.py b/library/class_dreambooth_gui.py
@@ -36,6 +36,9 @@ def __init__(
                 lr_scheduler_value='cosine',
                 lr_warmup_value='10',
             )
+            self.full_bf16 = gr.Checkbox(
+                label='Full bf16', value = False
+            )
             with gr.Accordion('Advanced Configuration', open=False):
                 self.advanced_training = AdvancedTraining(headless=headless)
                 self.advanced_training.color_aug.change(