Rewrite initialization #607

AkshitaB · 2024-06-10T06:16:51Z

Simplifies our inscrutable initialization

IMPORTANT: currently, the implementation matches the old buggy values for init in several places. See below.
Removes init_weights with its complex if-else logic.
Adds init_normal which only takes the module, the std, and optionally a cutoff_factor.
std and cutoff_factor computation is now handled in each module's reset_parameters()
Adds unit tests for initialization.
Removes implementation for kaiming_normal and fan_in InitFnType as these aren't being used anywhere. Can be added later if needed.

Potential bugs found in initialization as a result of the refactoring (these will be fixed after feedback):

OLMoBlock.ff_out's normal initialization multiples std by an extra factor of 1 / math.sqrt(2 * self.config.n_layers. This potentially came from trying to incorporate full_megatron into the same function.
Hardcoded values: mitchell hardcodes a cutoff_factor of 3.0 (always truncated_normal_ with 3.0). full_megatron hardcodes a default cutoff_factor of 3.0 (truncated_normal_ with config.init_cutoff_factor or 3.0). Again, this may be a result of trying to incorporate multiple inits into the same function. Ideally, the cutoff_factor should always come from the configurable config.init_cutoff_factor; do we want to set always this value to 3.0 for mitchell and megatron?
Need clarification: Why do we scale the embedding with the following factor if scale_logits=True?
emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0
Additionally, in case of mitchell init, due to supplying the factor at multiple places in the old code, std ends up always being 0.5 when scale_logits=True!

epwalsh

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True?
emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

olmo/model.py

AkshitaB · 2024-06-10T16:57:48Z

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True?
emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

Wouldn't this make more sense if we did this when weight_tying was on? I'm trying to get a sense of intuition for some of these choices/tricks.

Co-authored-by: Pete <[email protected]>

epwalsh · 2024-06-10T17:01:51Z

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True?
emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

Wouldn't this make more sense if we did this when weight_tying was on? I'm trying to get a sense of intuition for some of these choices/tricks.

Yea I'm guessing that's the only scenario where we tried it? It might have come from PaLM.

…tead

epwalsh

LGTM

olmo/model.py

AkshitaB added 18 commits June 9, 2024 17:00

add tests for normal init

67de281

add init_normal

23f93c2

fix llama block tests

fa66583

wip: simplified normal init

cc5f6eb

check scale_logits

2c8155c

use InitFnType

238fed8

add simplified mitchell init

2d660f9

Add tests for megatron init

3d5a90f

simply further

9c7eb38

init with simple megatron

f3802ac

refactor

6ad47ee

formatting

6034de1

add init test for wpe

bc10a2d

remove init_weights

8073fed

add changelog

fbc0421

use InitFnType

210b851

fix indent

4abc784

formatting, remove unnecessary todos

03c2bb5

AkshitaB requested review from dirkgr and epwalsh June 10, 2024 06:56

epwalsh requested changes Jun 10, 2024

View reviewed changes

olmo/model.py Outdated Show resolved Hide resolved

olmo/model.py Show resolved Hide resolved

olmo/model.py Outdated Show resolved Hide resolved

olmo/model.py Show resolved Hide resolved

olmo/model.py Outdated Show resolved Hide resolved

Update olmo/model.py

2b4b31f

Co-authored-by: Pete <[email protected]>

AkshitaB added 6 commits June 10, 2024 13:58

remove extra divisor for normal

2390666

remove hardcoded 3 for mitchell, make it default and configurable ins…

1346d08

…tead

restrict scale_logits to just normal init

a9b0bfd

warn in changelog

8bb47d8

remove scale logits

b79464c

don't break old configs

7a2fb96

Merge branch 'main' into rewrite-init

3b7df3b

epwalsh approved these changes Jun 10, 2024

View reviewed changes

olmo/model.py Outdated Show resolved Hide resolved

AkshitaB merged commit c2cedbc into main Jun 10, 2024
12 checks passed

AkshitaB deleted the rewrite-init branch June 10, 2024 23:52

AkshitaB mentioned this pull request Jul 16, 2024

Chameleon stability experiments #616

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite initialization #607

Rewrite initialization #607

AkshitaB commented Jun 10, 2024 •

edited

Loading

epwalsh left a comment

AkshitaB commented Jun 10, 2024

epwalsh commented Jun 10, 2024

epwalsh left a comment

Rewrite initialization #607

Rewrite initialization #607

Conversation

AkshitaB commented Jun 10, 2024 • edited Loading

epwalsh left a comment

Choose a reason for hiding this comment

AkshitaB commented Jun 10, 2024

epwalsh commented Jun 10, 2024

epwalsh left a comment

Choose a reason for hiding this comment

AkshitaB commented Jun 10, 2024 •

edited

Loading