Recompute mlp #676

ngc92 · 2024-07-11T10:26:22Z

Add the option to recompute the matmul in the MLP part.
Since this output is the largest remaining single tensor, we do get substantial improvements in memory consumption.

With 350M model at batch size 20, activation memory goes from 12042 MiB to 8362 MiB; conversely, on 2x4060Ti, I can increase the batch size from 20 to 28 without OOM. Overall, there is still a slowdown 30ktok/s to 28ktok/s, but
a) we cannot increase gradient accumulation arbitrarily due to bf16 rounding problems, so getting more tokens per fwd/bwd allows larger effective batch sizes
b) for smaller cards/larger models/longer contexts, this might make the difference between being able to run or not

so I think despite the noticeable drop in tok/s, this is an option that we want to have.

ngc92 added 3 commits July 11, 2024 10:55

const fixes

657ae2a

option to recompute the entire MLP block

2720c1e

join all MLP recomputation into one code block

2e4f28b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recompute mlp #676

Recompute mlp #676

ngc92 commented Jul 11, 2024

Recompute mlp #676

Are you sure you want to change the base?

Recompute mlp #676

Conversation

ngc92 commented Jul 11, 2024