Skip to content

Latest commit

 

History

History
21 lines (11 loc) · 647 Bytes

thinking_fast_and_slow.md

File metadata and controls

21 lines (11 loc) · 647 Bytes

thinking fast and slow

labels: experimental

weights = W

decompose W into W = W1 + W2 s.t. W1 and W2 have same dimension

set a (alpha) to be a mixing rate, which starts at zero.

Learn W as W = W1 + a * W2, increasing a throughout training in proportion to lr

W2 are the "slow" weights and are learned conventionally

W1 will be learned parameterized as a hyperlora, and so are our "fast" weights.

let W1 = VZ where V is a learnable vector and Z is a fixed, randomly initialized orthonormal matrix (i.e. random projections)

the "slow" weights are essentially a residual.

we could "stack" residuals if we wanted higher-order granularity