fix ac 'checkpointing' spelling, minor spacing tweaks (pytorch#265)

This PR is mainly to fix the spelling where activation checkpointing is missing an n... (**checkpoiting**). Not sure how I missed it earlier but it's glaring when you see the charts in visual form (vs text). <img width="578" alt="Screenshot 2024-04-24 at 2 45 25 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/a81727b2-07b1-4d69-a0c1-743d74d2aa5a"> fixed: <img width="592" alt="Screenshot 2024-04-24 at 3 10 30 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/769e51db-4aa6-4dbd-99d8-7e691658e280"> Also add a couple line breaks to help with layout, and one or two minor grammar updates.
tianyu-l · Apr 24, 2024 · ab69602 · ab69602
1 parent 779a7c9
commit ab69602
Showing 1 changed file with 7 additions and 5 deletions.
diff --git a/docs/performance.md b/docs/performance.md
@@ -1,11 +1,12 @@
-To demonstrate the effectiveness of techniques used in torchtitan, we report both the infra metrics and loss curves of Llama 2 (13B and 70B) and Llama 3 (8B and 70B) training on 64 A100 (80GB memory) GPUs. We report infra metrics achieved by [FSDP2](fsdp.md) (1D parallelism) under various configurations, and loss curves for both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel) training.
+To demonstrate the effectiveness of PyTorch distributed training techniques used in torchtitan, we report both the infra metrics and loss curves of Llama 2 (13B and 70B) and Llama 3 (8B and 70B) training on 64 A100 (80GB memory) GPUs.
+We report infra metrics achieved by [FSDP2](fsdp.md) (1D parallelism) under various configurations, and loss curves for both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel) training.
 
 
 ## Llama 3 performance numbers
 
 Below are the WPS (word per second, or more accurately, token per second) and MFU (model FLOPS utilization) results which torchtitan achieves on Llama 3 models with FSDP2 on 64 A100 (80GB) GPUs. The way we compute WPS and MFU can be found in `train.py`.
 
-| Model size | Batch size | Activation checkpoiting | WPS | MFU |
+| Model size | Batch size | Activation checkpointing | WPS | MFU |
 | ----- | ----- | ----- | ----- | ----- |
 | 8B | 1 | selective layer | 2904 | 56.8% |
 | 8B | 1 | selective op | 2973 | 58.2% |
@@ -22,17 +23,18 @@ Next we show the loss curves for Llama 3 8B and Llama 3 70B training with both 1
 
 Below are the WPS and MFU results which torchtitan achieves on Llama 2 models with FSDP2 on 64 A100 (80GB) GPUs.
 
-| Model size | Batch size | Activation checkpoiting | WPS | MFU |
+| Model size | Batch size | Activation checkpointing | WPS | MFU |
 | ----- | ----- | ----- | ----- | ----- |
 | 13B | 2 | no | 2162 | 61.1%	|
 | 13B | 2 | selective layer | 1914 | 54.1% |
 | 13B | 2 | selective op | 1904 | 53.8% |
 | 70B | 1[^1] | selective op | 355 | 50.8% |
 | 70B | 2 | full | 353 | 50.5% |
 
-We primarily use local batch size 2 (global batch size 128) in the experiments, to keep the same number of tokens per training iteration between Llama 2 and Llama 3 (since the default sequence length in Llama 2 is 4096 which is halved compared with Llama 3). In fact, for Llama 2 70B model with full activation checkpointing, the MFU can go up to 54% when local batch size is higher (but before OOM happens).
+We primarily use local batch size 2 (global batch size 128) in the experiments, to keep the same number of tokens per training iteration between Llama 2 and Llama 3 (since the default sequence length in Llama 2 is 4096 which is halved compared with Llama 3). In fact, for Llama 2 70B model with full activation checkpointing, the MFU can go up to 54% when local batch size is higher (but before an OOM happens).
 
-Next we show the loss curves for Llama 2 13B and Llama 2 70B training with both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel). All four models are trained 3000 steps with global batch size 128. In terms of activation checkpointing (AC) configs, the Llama 2 13B training jobs use selective op AC, whereas the Llama 70B training jobs use full AC. The results are shown in the picture (a TensorBoard screenshot) below[^2].
+Next we show the loss curves for Llama 2 13B and Llama 2 70B training with both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel). All four models are trained 3000 steps with global batch size 128.
+In terms of activation checkpointing (AC) configs, the Llama 2 13B training jobs use selective op AC, whereas the Llama 70B training jobs use full AC. The results are shown in the picture (a TensorBoard screenshot) below[^2].
 
 ![image](../assets/images/llama2_loss_curves.png)