-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite initialization #607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major concerns. I'm glad we're cleaning this up.
Why do we scale the embedding with the following factor if scale_logits=True?
emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.
This was another "trick" we heard works from someone else (not sure who).
Wouldn't this make more sense if we did this when |
Co-authored-by: Pete <[email protected]>
Yea I'm guessing that's the only scenario where we tried it? It might have come from PaLM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Simplifies our inscrutable initialization
init_weights
with its complex if-else logic.init_normal
which only takes the module, the std, and optionally a cutoff_factor.reset_parameters()
kaiming_normal
andfan_in
InitFnType as these aren't being used anywhere. Can be added later if needed.Potential bugs found in initialization as a result of the refactoring (these will be fixed after feedback):
OLMoBlock.ff_out
'snormal
initialization multiples std by an extra factor of1 / math.sqrt(2 * self.config.n_layers
. This potentially came from trying to incorporatefull_megatron
into the same function.mitchell
hardcodes a cutoff_factor of 3.0 (always truncated_normal_ with 3.0).full_megatron
hardcodes a default cutoff_factor of 3.0 (truncated_normal_ withconfig.init_cutoff_factor or 3.0
). Again, this may be a result of trying to incorporate multiple inits into the same function. Ideally, the cutoff_factor should always come from the configurableconfig.init_cutoff_factor
; do we want to set always this value to 3.0 for mitchell and megatron?scale_logits=True
?emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0
mitchell
init, due to supplying the factor at multiple places in the old code, std ends up always being 0.5 whenscale_logits=True
!