Skip to content

Commit

Permalink
fix ac 'checkpointing' spelling, minor spacing tweaks (pytorch#265)
Browse files Browse the repository at this point in the history
This PR is mainly to fix the spelling where activation checkpointing is
missing an n... (**checkpoiting**).
Not sure how I missed it earlier but it's glaring when you see the
charts in visual form (vs text).

<img width="578" alt="Screenshot 2024-04-24 at 2 45 25 PM"
src="https://github.com/pytorch/torchtitan/assets/46302957/a81727b2-07b1-4d69-a0c1-743d74d2aa5a">

fixed:
<img width="592" alt="Screenshot 2024-04-24 at 3 10 30 PM"
src="https://github.com/pytorch/torchtitan/assets/46302957/769e51db-4aa6-4dbd-99d8-7e691658e280">


Also add a couple line breaks to help with layout, and one or two minor
grammar updates.
  • Loading branch information
lessw2020 authored Apr 24, 2024
1 parent 779a7c9 commit ab69602
Showing 1 changed file with 7 additions and 5 deletions.
12 changes: 7 additions & 5 deletions docs/performance.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
To demonstrate the effectiveness of techniques used in torchtitan, we report both the infra metrics and loss curves of Llama 2 (13B and 70B) and Llama 3 (8B and 70B) training on 64 A100 (80GB memory) GPUs. We report infra metrics achieved by [FSDP2](fsdp.md) (1D parallelism) under various configurations, and loss curves for both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel) training.
To demonstrate the effectiveness of PyTorch distributed training techniques used in torchtitan, we report both the infra metrics and loss curves of Llama 2 (13B and 70B) and Llama 3 (8B and 70B) training on 64 A100 (80GB memory) GPUs.
We report infra metrics achieved by [FSDP2](fsdp.md) (1D parallelism) under various configurations, and loss curves for both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel) training.


## Llama 3 performance numbers

Below are the WPS (word per second, or more accurately, token per second) and MFU (model FLOPS utilization) results which torchtitan achieves on Llama 3 models with FSDP2 on 64 A100 (80GB) GPUs. The way we compute WPS and MFU can be found in `train.py`.

| Model size | Batch size | Activation checkpoiting | WPS | MFU |
| Model size | Batch size | Activation checkpointing | WPS | MFU |
| ----- | ----- | ----- | ----- | ----- |
| 8B | 1 | selective layer | 2904 | 56.8% |
| 8B | 1 | selective op | 2973 | 58.2% |
Expand All @@ -22,17 +23,18 @@ Next we show the loss curves for Llama 3 8B and Llama 3 70B training with both 1

Below are the WPS and MFU results which torchtitan achieves on Llama 2 models with FSDP2 on 64 A100 (80GB) GPUs.

| Model size | Batch size | Activation checkpoiting | WPS | MFU |
| Model size | Batch size | Activation checkpointing | WPS | MFU |
| ----- | ----- | ----- | ----- | ----- |
| 13B | 2 | no | 2162 | 61.1% |
| 13B | 2 | selective layer | 1914 | 54.1% |
| 13B | 2 | selective op | 1904 | 53.8% |
| 70B | 1[^1] | selective op | 355 | 50.8% |
| 70B | 2 | full | 353 | 50.5% |

We primarily use local batch size 2 (global batch size 128) in the experiments, to keep the same number of tokens per training iteration between Llama 2 and Llama 3 (since the default sequence length in Llama 2 is 4096 which is halved compared with Llama 3). In fact, for Llama 2 70B model with full activation checkpointing, the MFU can go up to 54% when local batch size is higher (but before OOM happens).
We primarily use local batch size 2 (global batch size 128) in the experiments, to keep the same number of tokens per training iteration between Llama 2 and Llama 3 (since the default sequence length in Llama 2 is 4096 which is halved compared with Llama 3). In fact, for Llama 2 70B model with full activation checkpointing, the MFU can go up to 54% when local batch size is higher (but before an OOM happens).

Next we show the loss curves for Llama 2 13B and Llama 2 70B training with both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel). All four models are trained 3000 steps with global batch size 128. In terms of activation checkpointing (AC) configs, the Llama 2 13B training jobs use selective op AC, whereas the Llama 70B training jobs use full AC. The results are shown in the picture (a TensorBoard screenshot) below[^2].
Next we show the loss curves for Llama 2 13B and Llama 2 70B training with both 1D parallelism (FSDP2) and 2D parallelism (FSDP2 + Tensor Parallel). All four models are trained 3000 steps with global batch size 128.
In terms of activation checkpointing (AC) configs, the Llama 2 13B training jobs use selective op AC, whereas the Llama 70B training jobs use full AC. The results are shown in the picture (a TensorBoard screenshot) below[^2].

![image](../assets/images/llama2_loss_curves.png)

Expand Down

0 comments on commit ab69602

Please sign in to comment.