Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unused parameters #4

Open
zaptrem opened this issue Jun 13, 2024 · 1 comment
Open

Unused parameters #4

zaptrem opened this issue Jun 13, 2024 · 1 comment

Comments

@zaptrem
Copy link

zaptrem commented Jun 13, 2024

self.normC2 = Fp32LayerNorm(dim, bias=False)

self.w1o = nn.Linear(dim, dim, bias=False)

Not used in last layer, should be moved into an if not last statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-design

Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?

@cloneofsimo
Copy link
Owner

Ah yes, you are correct.

| Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

I just dont find clip embedding useful when I inference with them. Kinda my personal thing.
Because muP devides the global learning rate with input dimension, its actually more like 1e-4 in practice for fat layers.
For biases or input, its much larger, which is the rationale behind muP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants