LoRA weight underflows #49

CCRcmcpe · 2023-01-20T22:06:49Z

I encountered the same problem as #41, which states "LoRAs have no effects".

Background

I'm using SSDT to train LoRAs, the LoRA layers implementations are from loralib, which have weight scaling = (rank / alpha).

So if I want to use a alpha = 1, rank-16 LoRA produced by SSDT in AddNet, scale should be set to 1/16.

Some users found this additional scaling not convenient, I added an unscale weight option to scale weight by (alpha / rank) when converting SSDT checkpoint to AddNet format.

Investigation

I looked the state dict after unscaling.

All of the tensors have pretty small values, which can cause low numerical stability. ~20% of them have zero values, in this 20%, 15% in text encoder, 85% in UNet.

Experiment

In one LoRA (rank=16, alpha=1) I trained,

All unscaled LoRAs have basically no effect.
Not unscaled, scale = 0.0625 (1 / 16) worked as normal.

Conclusion and Solution

I suspect those zeros are products of underflow, which probably is the cause of #41.

Those underflows happens more often if rank is high.

At training time, add option: "alpha" to scale LoRA like loralib. Save alpha to LoRA metadata.

At inference time, add option: "scale weight" to scale LoRA weight by rank / alpha.

Backward Compatibility

Unfortunately, as you can imagine, almost all existing LoRAs have already underflowed.

If "scale weight" is enabled, for still using old LoRAs, if a LoRA have no alpha in metadata, do not scale.

Additional: NaNs

After AUTOMATIC1111/stable-diffusion-webui@9991967, when generating images, those underflowed LoRAs sometimes produces NaN errors.

Some users reported loss=NaN when using https://github.com/Linaqruf/kohya-trainer/ and https://github.com/Mikubill/naifu-diffusion/, especially at high rank. I suspect that's related to this issue.

The text was updated successfully, but these errors were encountered:

kohya-ss · 2023-01-21T01:26:27Z

Thank you for the detailed report! This issue is important.

I found the issue that some LoRA modules are not trained with higher rank and 'fp16'. But I did not realize it was due to the underflow.

The solution seems to be good.
I wonder it is possible to have the alpha value as a non-trainable parameter of the LoRA module. If it is possible, it will be better than depending the metadata, and also .ckpt could support the alpha.

I am not very familiar with PyTorch and would appreciate any suggestions you may have.

For the backward compatibility, I think your suggestion is best.

CCRcmcpe · 2023-01-21T02:17:33Z

Putting alpha in module parameter indeed is better.
Still, those fixes need to be applied on the training-end too, hope issue will be resolved soon to prevent more underflowed LoRAs being trained and published.

kohya-ss · 2023-01-21T02:31:47Z

Thank you for your comment.

I will update the training script and the extension as soon as possible.

kohya-ss · 2023-01-21T07:28:48Z

I've implemented the scaling feature like this. If you have any comments, I would appreciate it.

class LoRAModule(torch.nn.Module):
  """
  replaces forward method of the original Linear, instead of replacing the original Linear module.
  """

  def __init__(self, lora_name, org_module: torch.nn.Module, multiplier=1.0, lora_dim=4, alpha=1):
    """ if alpha == 0 or None, alpha is rank (no scaling). """
    super().__init__()
    self.lora_name = lora_name
    self.lora_dim = lora_dim

    if org_module.__class__.__name__ == 'Conv2d':
      in_dim = org_module.in_channels
      out_dim = org_module.out_channels
      self.lora_down = torch.nn.Conv2d(in_dim, lora_dim, (1, 1), bias=False)
      self.lora_up = torch.nn.Conv2d(lora_dim, out_dim, (1, 1), bias=False)
    else:
      in_dim = org_module.in_features
      out_dim = org_module.out_features
      self.lora_down = torch.nn.Linear(in_dim, lora_dim, bias=False)
      self.lora_up = torch.nn.Linear(lora_dim, out_dim, bias=False)

    alpha = lora_dim if alpha is None or alpha == 0 else alpha
    self.scale = alpha / self.lora_dim
    self.register_buffer('alpha', torch.tensor([alpha]))                    # 定数として扱える

    # same as microsoft's
    torch.nn.init.kaiming_uniform_(self.lora_down.weight, a=math.sqrt(5))
    torch.nn.init.zeros_(self.lora_up.weight)

    self.multiplier = multiplier
    self.org_module = org_module                  # remove in applying

  def apply_to(self):
    self.org_forward = self.org_module.forward
    self.org_module.forward = self.forward
    del self.org_module

  def forward(self, x):
    return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale

CCRcmcpe · 2023-01-21T08:46:29Z

I see no problem. torch.tensor([alpha]) can be replaced by torch.tensor(alpha) (which is a scalar instead of a 1-element vector), but that's most a coding style change.

kohya-ss · 2023-01-21T08:59:08Z

Thank you for your comment!
I will replace to torch.tensor(alpha) to simplify, and work on the extension.

kohya-ss · 2023-01-22T13:30:43Z

I think this issue has been resolved. Thank you again for your work!

Please re-open if you have any questions :)

AUTOMATIC1111 · 2023-01-23T12:58:20Z

Soes anyone have an example of a SD1 lora with this for me to test and add support for?

kohya-ss · 2023-01-23T14:26:16Z

Hi, @AUTOMATIC1111 Thank you for supporting LoRA in web UI!

I've uploaded the LoRA model to my blog. The post is written in Japanese, but you will find cjgg_frog.safetensors here:
https://note.com/kohya_ss/n/nb20c5187e15a#551bd752-78f3-468f-b48f-e8f78f6d399b

The LoRA is trained with SD 1.5, and activation word is usu frog. It will bring comical frog like this:

The state_dict now have alpha values for each LoRA module like this:

>>> print("\n".join([f"{key}\t{value.size()}" for key, value in list(sd.items())[:6]]))
lora_te_text_model_encoder_layers_0_mlp_fc1.alpha       torch.Size([])
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight    torch.Size([4, 768])
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight      torch.Size([3072, 4])
lora_te_text_model_encoder_layers_0_mlp_fc2.alpha       torch.Size([])
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_down.weight    torch.Size([4, 3072])
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_up.weight      torch.Size([768, 4])

And LoRA are scaled by alpha / dim like this:

sd-webui-additional-networks/scripts/lora_compvis.py

Lines 44 to 67 in 4f0b17d

    
               if type(alpha) == torch.Tensor: 
        
                 alpha = alpha.detach().float().numpy()                              # without casting, bf16 causes error 
        
               alpha = lora_dim if alpha is None or alpha == 0 else alpha 
        
               self.scale = alpha / self.lora_dim 
        
               self.register_buffer('alpha', torch.tensor(alpha))                    # 定数として扱える 
        
               # same as microsoft's 
        
               torch.nn.init.kaiming_uniform_(self.lora_down.weight, a=math.sqrt(5)) 
        
               torch.nn.init.zeros_(self.lora_up.weight) 
        
               self.multiplier = multiplier 
        
               self.org_forward = org_module.forward 
        
               self.org_module = org_module                  # remove in applying 
        
             def apply_to(self): 
        
               self.org_forward = self.org_module.forward 
        
               self.org_module.forward = self.forward 
        
               del self.org_module 
        
             def forward(self, x): 
        
               """ 
        
               may be cascaded. 
        
               """ 
        
               return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale

I hope this helps!

AUTOMATIC1111 · 2023-01-23T14:43:31Z

Thanks! Great work on your training repos!

AUTOMATIC1111 · 2023-01-23T15:21:42Z

Got it working.

vs

While you're here, what's the reason you decided to implement those layers on SD2 differently - merging weights with models instead of how you did it for all other layers, finding the relevant layer and overriding its forward method?

kohya-ss · 2023-01-23T23:38:44Z

Looks good!

While you're here, what's the reason you decided to implement those layers on SD2 differently - merging weights with models instead of how you did it for all other layers, finding the relevant layer and overriding its forward method?

Because OpenClip (Text Encoder used in SD2) uses torch.nn.MultiheadAttention in ResidualAttentionBlock in their Transformer, instead of processing Q/K/V/out independently.

https://github.com/mlfoundations/open_clip/blob/694554495aedf97ac046e53a690ecd86aee96274/src/open_clip/transformer.py#L176

It will be possible to override forward of the ResidualAttentionBlock, but I'm not sure how to do it, so I merge weights for MultiheadAttention.

Btw, I found a bug that alpha is not used when merging weights, I fixed that.

KyanChen · 2023-07-28T08:34:55Z

see https://github.com/KyanChen/MakeMultiHeadNaive/tree/master

kohya-ss added the bug Something isn't working label Jan 21, 2023

shirayu mentioned this issue Jan 21, 2023

Training of the character does not work shirayu/example_lora_training#1

Closed

This was referenced Jan 22, 2023

Underflow when saving LoRa as fp16 kohya-ss/sd-scripts#93

Closed

support alpha etc. #51

Merged

kohya-ss closed this as completed in #51 Jan 22, 2023

kohya-ss mentioned this issue Apr 9, 2023

[Clarification] LORA layer scaling lr kohya-ss/sd-scripts#363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA weight underflows #49

LoRA weight underflows #49

CCRcmcpe commented Jan 20, 2023

kohya-ss commented Jan 21, 2023

CCRcmcpe commented Jan 21, 2023

kohya-ss commented Jan 21, 2023

kohya-ss commented Jan 21, 2023

CCRcmcpe commented Jan 21, 2023

kohya-ss commented Jan 21, 2023

kohya-ss commented Jan 22, 2023

AUTOMATIC1111 commented Jan 23, 2023

kohya-ss commented Jan 23, 2023

AUTOMATIC1111 commented Jan 23, 2023

AUTOMATIC1111 commented Jan 23, 2023

kohya-ss commented Jan 23, 2023

KyanChen commented Jul 28, 2023

LoRA weight underflows #49

LoRA weight underflows #49

Comments

CCRcmcpe commented Jan 20, 2023

Background

Investigation

Experiment

Conclusion and Solution

Backward Compatibility

Additional: NaNs

kohya-ss commented Jan 21, 2023

CCRcmcpe commented Jan 21, 2023

kohya-ss commented Jan 21, 2023

kohya-ss commented Jan 21, 2023

CCRcmcpe commented Jan 21, 2023

kohya-ss commented Jan 21, 2023

kohya-ss commented Jan 22, 2023

AUTOMATIC1111 commented Jan 23, 2023

kohya-ss commented Jan 23, 2023

AUTOMATIC1111 commented Jan 23, 2023

AUTOMATIC1111 commented Jan 23, 2023

kohya-ss commented Jan 23, 2023

KyanChen commented Jul 28, 2023