Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA weight underflows #49

Closed
CCRcmcpe opened this issue Jan 20, 2023 · 13 comments · Fixed by #51
Closed

LoRA weight underflows #49

CCRcmcpe opened this issue Jan 20, 2023 · 13 comments · Fixed by #51
Labels
bug Something isn't working

Comments

@CCRcmcpe
Copy link

I encountered the same problem as #41, which states "LoRAs have no effects".

Background

I'm using SSDT to train LoRAs, the LoRA layers implementations are from loralib, which have weight scaling = (rank / alpha).

So if I want to use a alpha = 1, rank-16 LoRA produced by SSDT in AddNet, scale should be set to 1/16.

Some users found this additional scaling not convenient, I added an unscale weight option to scale weight by (alpha / rank) when converting SSDT checkpoint to AddNet format.

Investigation

I looked the state dict after unscaling.

image
image

All of the tensors have pretty small values, which can cause low numerical stability. ~20% of them have zero values, in this 20%, 15% in text encoder, 85% in UNet.

Experiment

In one LoRA (rank=16, alpha=1) I trained,

tmpnsz0jyrk

  • All unscaled LoRAs have basically no effect.
  • Not unscaled, scale = 0.0625 (1 / 16) worked as normal.

Conclusion and Solution

I suspect those zeros are products of underflow, which probably is the cause of #41.

Those underflows happens more often if rank is high.

At training time, add option: "alpha" to scale LoRA like loralib. Save alpha to LoRA metadata.

At inference time, add option: "scale weight" to scale LoRA weight by rank / alpha.

Backward Compatibility

Unfortunately, as you can imagine, almost all existing LoRAs have already underflowed.

If "scale weight" is enabled, for still using old LoRAs, if a LoRA have no alpha in metadata, do not scale.

Additional: NaNs

After AUTOMATIC1111/stable-diffusion-webui@9991967, when generating images, those underflowed LoRAs sometimes produces NaN errors.

Some users reported loss=NaN when using https://github.com/Linaqruf/kohya-trainer/ and https://github.com/Mikubill/naifu-diffusion/, especially at high rank. I suspect that's related to this issue.

@kohya-ss kohya-ss added the bug Something isn't working label Jan 21, 2023
@kohya-ss
Copy link
Owner

Thank you for the detailed report! This issue is important.

I found the issue that some LoRA modules are not trained with higher rank and 'fp16'. But I did not realize it was due to the underflow.

The solution seems to be good.
I wonder it is possible to have the alpha value as a non-trainable parameter of the LoRA module. If it is possible, it will be better than depending the metadata, and also .ckpt could support the alpha.

I am not very familiar with PyTorch and would appreciate any suggestions you may have.

For the backward compatibility, I think your suggestion is best.

@CCRcmcpe
Copy link
Author

Putting alpha in module parameter indeed is better.
Still, those fixes need to be applied on the training-end too, hope issue will be resolved soon to prevent more underflowed LoRAs being trained and published.

@kohya-ss
Copy link
Owner

Thank you for your comment.

I will update the training script and the extension as soon as possible.

@kohya-ss
Copy link
Owner

I've implemented the scaling feature like this. If you have any comments, I would appreciate it.

class LoRAModule(torch.nn.Module):
  """
  replaces forward method of the original Linear, instead of replacing the original Linear module.
  """

  def __init__(self, lora_name, org_module: torch.nn.Module, multiplier=1.0, lora_dim=4, alpha=1):
    """ if alpha == 0 or None, alpha is rank (no scaling). """
    super().__init__()
    self.lora_name = lora_name
    self.lora_dim = lora_dim

    if org_module.__class__.__name__ == 'Conv2d':
      in_dim = org_module.in_channels
      out_dim = org_module.out_channels
      self.lora_down = torch.nn.Conv2d(in_dim, lora_dim, (1, 1), bias=False)
      self.lora_up = torch.nn.Conv2d(lora_dim, out_dim, (1, 1), bias=False)
    else:
      in_dim = org_module.in_features
      out_dim = org_module.out_features
      self.lora_down = torch.nn.Linear(in_dim, lora_dim, bias=False)
      self.lora_up = torch.nn.Linear(lora_dim, out_dim, bias=False)

    alpha = lora_dim if alpha is None or alpha == 0 else alpha
    self.scale = alpha / self.lora_dim
    self.register_buffer('alpha', torch.tensor([alpha]))                    # 定数として扱える

    # same as microsoft's
    torch.nn.init.kaiming_uniform_(self.lora_down.weight, a=math.sqrt(5))
    torch.nn.init.zeros_(self.lora_up.weight)

    self.multiplier = multiplier
    self.org_module = org_module                  # remove in applying

  def apply_to(self):
    self.org_forward = self.org_module.forward
    self.org_module.forward = self.forward
    del self.org_module

  def forward(self, x):
    return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale

@CCRcmcpe
Copy link
Author

I see no problem. torch.tensor([alpha]) can be replaced by torch.tensor(alpha) (which is a scalar instead of a 1-element vector), but that's most a coding style change.

@kohya-ss
Copy link
Owner

Thank you for your comment!
I will replace to torch.tensor(alpha) to simplify, and work on the extension.

@kohya-ss
Copy link
Owner

I think this issue has been resolved. Thank you again for your work!

Please re-open if you have any questions :)

@AUTOMATIC1111
Copy link

Soes anyone have an example of a SD1 lora with this for me to test and add support for?

@kohya-ss
Copy link
Owner

Hi, @AUTOMATIC1111 Thank you for supporting LoRA in web UI!

I've uploaded the LoRA model to my blog. The post is written in Japanese, but you will find cjgg_frog.safetensors here:
https://note.com/kohya_ss/n/nb20c5187e15a#551bd752-78f3-468f-b48f-e8f78f6d399b

The LoRA is trained with SD 1.5, and activation word is usu frog. It will bring comical frog like this:
image

The state_dict now have alpha values for each LoRA module like this:

>>> print("\n".join([f"{key}\t{value.size()}" for key, value in list(sd.items())[:6]]))
lora_te_text_model_encoder_layers_0_mlp_fc1.alpha       torch.Size([])
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight    torch.Size([4, 768])
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight      torch.Size([3072, 4])
lora_te_text_model_encoder_layers_0_mlp_fc2.alpha       torch.Size([])
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_down.weight    torch.Size([4, 3072])
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_up.weight      torch.Size([768, 4])

And LoRA are scaled by alpha / dim like this:

if type(alpha) == torch.Tensor:
alpha = alpha.detach().float().numpy() # without casting, bf16 causes error
alpha = lora_dim if alpha is None or alpha == 0 else alpha
self.scale = alpha / self.lora_dim
self.register_buffer('alpha', torch.tensor(alpha)) # 定数として扱える
# same as microsoft's
torch.nn.init.kaiming_uniform_(self.lora_down.weight, a=math.sqrt(5))
torch.nn.init.zeros_(self.lora_up.weight)
self.multiplier = multiplier
self.org_forward = org_module.forward
self.org_module = org_module # remove in applying
def apply_to(self):
self.org_forward = self.org_module.forward
self.org_module.forward = self.forward
del self.org_module
def forward(self, x):
"""
may be cascaded.
"""
return self.org_forward(x) + self.lora_up(self.lora_down(x)) * self.multiplier * self.scale

I hope this helps!

@AUTOMATIC1111
Copy link

Thanks! Great work on your training repos!

@AUTOMATIC1111
Copy link

Got it working.

00318-1
vs
00319-1

While you're here, what's the reason you decided to implement those layers on SD2 differently - merging weights with models instead of how you did it for all other layers, finding the relevant layer and overriding its forward method?

@kohya-ss
Copy link
Owner

Looks good!

While you're here, what's the reason you decided to implement those layers on SD2 differently - merging weights with models instead of how you did it for all other layers, finding the relevant layer and overriding its forward method?

Because OpenClip (Text Encoder used in SD2) uses torch.nn.MultiheadAttention in ResidualAttentionBlock in their Transformer, instead of processing Q/K/V/out independently.

https://github.com/mlfoundations/open_clip/blob/694554495aedf97ac046e53a690ecd86aee96274/src/open_clip/transformer.py#L176

It will be possible to override forward of the ResidualAttentionBlock, but I'm not sure how to do it, so I merge weights for MultiheadAttention.

Btw, I found a bug that alpha is not used when merging weights, I fixed that.

@KyanChen
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants