Unused parameters #4

zaptrem · 2024-06-13T21:11:57Z

Line 161 in 261859e

self.normC2 = Fp32LayerNorm(dim, bias=False)

Line 86 in 72feb0c

self.w1o = nn.Linear(dim, dim, bias=False)

Not used in last layer, should be moved into an if not last statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-design

Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?

The text was updated successfully, but these errors were encountered:

cloneofsimo · 2024-06-24T06:52:51Z

Ah yes, you are correct.

| Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

I just dont find clip embedding useful when I inference with them. Kinda my personal thing.
Because muP devides the global learning rate with input dimension, its actually more like 1e-4 in practice for fat layers.
For biases or input, its much larger, which is the rationale behind muP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused parameters #4

Unused parameters #4

zaptrem commented Jun 13, 2024 •

edited

Loading

cloneofsimo commented Jun 24, 2024

Unused parameters #4

Unused parameters #4

Comments

zaptrem commented Jun 13, 2024 • edited Loading

cloneofsimo commented Jun 24, 2024

zaptrem commented Jun 13, 2024 •

edited

Loading