8-bit Adam #463

gau-nernst · 2024-07-02T14:14:43Z

To fine-tune a pre-trained ViT-Base on resisc45 dataset with BF16 AMP, using default Adam optimizer from PyTorch core

python train.py \
  --model "timm/vit_base_patch16_224.augreg_in21k" \
  --amp bf16 \
  --optim Adam

To use bnb 8-bit optimizer, set --optim AdamBnb8bit. To use 8-bit optimizer implemented in this PR, set --optim AdamDTQ8bit.

Adam impl	max memory (GB)	training time	accuracy
PyTorch	5.26	9m 11s	93.62%
bnb 8-bit	4.78	9m 10s	93.06%
ao 8-bit	4.78	9m 15s	94.14%

To use wandb logging, set --project AdamInt8 and --run_name vit_base_bf16_amp (change as needed).
To profile and export chrome trace, set --profile
To enable cosine learning rate scheduler, set --cosine_lr_scheduler

Known limitation: when learning rate is updated every step (e.g. using cosine learning rate scheduler), training speed decreases significantly. This is because we have to convert learning rate to a CUDA tensor (which incurs expensive memory transfer cost), since torch.compile() will treat a Python float as a constant and trigger recompile whenever the value is changed.

pytorch-bot · 2024-07-02T14:14:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/463

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d86ec5e with merge base d1e15b4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-07-02T17:30:56Z

Regarding the LR scheduler limitation perhaps @mlazos knows what might be going on

mlazos · 2024-07-02T17:39:12Z

@gau-nernst are you compiling the optimizer? If not, the LR does not need to be a tensor. Can you share a profile? I'm not sure that copying a scalar to cuda mem should have much impact on performance E2E. Although it's possible that we are launching more kernels than necessary within the scheduler itself - this could possibly be rectified by converting to scalar before performing the calculations within the scheduler.

janeyx99 · 2024-07-02T17:41:13Z

torchao/prototype/optim_8bit/adam.py

+                # if it is a python float. practically, only lr is changed during training.
+                # NOTE: if lr is change at every step, moving lr to CUDA will be a bottleneck.
+                if not isinstance(group["lr"], Tensor):
+                    group["lr"] = torch.tensor(group["lr"], device=p.device)


If I understand this code correctly, the code right now initiates lr as a Tensor on the device of the current param just once per group, yes? And this fixes the recompile problem when lr changes value?

This assumes that all params in one param group have the same device, yes?

If I understand this code correctly, the code right now initiates lr as a Tensor on the device of the current param just once per group, yes? And this fixes the recompile problem when lr changes value?

Yes.

This assumes that all params in one param group have the same device, yes?

Yes. I hope this is a reasonable assumption. In what scenario this assumption would not be valid? e.g. FSDP?

msaroufim · 2024-07-02T18:48:46Z

train.py

@@ -0,0 +1,201 @@
+# pip install timm wandb tqdm datasets


let's put this in the benchmarks folder

Moved to benchmarks/benchmark_adam_8bit.py

torchao/prototype/optim_8bit/adam.py

gau-nernst · 2024-07-02T20:58:53Z

torchao/prototype/optim_8bit/adam.py

+
+
+# this will work with any optim state tensor subclass that implements aten.lerp.Scalar and aten.copy_.default
+@torch.compile(fullgraph=True, dynamic=True)


@mlazos I'm compiling the optimizer step for each param here (called vertical fusion?). This is necessary to fuse dequant+adam_step+quant together.

@msaroufim the profile is quite big, 200MB when I ran it (or I can try compressing it). Perhaps you can run and get the profile and share with @mlazos internally? The following code will profile the first 50 iterations. Lmk if you have problems running the code

python train.py \ --model "timm/vit_base_patch16_224.augreg_in21k" \ --amp bf16 \ --optim AdamDTQ8bit \ --profile

gau-nernst · 2024-07-02T21:11:47Z

train.py

+                grad_scaler.scale(loss).backward()
+
+                if args.cosine_lr_scheduler:
+                    lr = lr_schedule.get_lr(step)


@mlazos @msaroufim To add context, this is the way I normally do LR schedule. Calculate LR (as a python float) and set it directly to param groups. I don't use LR scheduler from torch.optim because I prefer my LR schedule to be stateless (instead of stateful like in torch.optim.lr_scheduler.LRScheduler, which keeps track of step count internally).

LR schedulers from timm are also stateless, calculate LR as python float and should have the same problem
https://github.com/huggingface/pytorch-image-models/blob/20fe56bd9072af61d9f5404ce8b08e24ff10a807/timm/scheduler/cosine_lr.py#L81-L109
https://github.com/huggingface/pytorch-image-models/blob/20fe56bd9072af61d9f5404ce8b08e24ff10a807/timm/scheduler/scheduler.py#L77-L98

gau-nernst · 2024-07-02T21:21:28Z

train.py

+                if args.cosine_lr_scheduler:
+                    lr = lr_schedule.get_lr(step)
+                    for param_group in optim.param_groups:
+                        param_group["lr"] = lr


I will try changing this to param_group["lr"].copy_(lr) instead. Maybe it helps.

Yes this was going to be my suggestion, where does the lr get wrapped in a tensor after getting retrieved from the scheduler? I don't see it in this code. If you were calling torch.tensor(lr, ..) this will cause an allocation of a scalar on every iteration, which is probably not good. copy_ is a better solution since it will populate the existing memory with the current value, which should have minimal impact on the performance.

Wrapping LR as a tensor is done on-the-fly inside the optimizer (you can scroll up to see where Jane commented).
I don't know if doing LR schedule like this is common. It's how I normally do it for my projects. We can state this as a known limitation (and the trick to solve is as you outlined here - once I confirm it helps).

Doesn't seem to help much, though the slowdown is less than I rmb.

Without LR schedule: 6.99 it/s

With LR schedule using tensor.copy_(lr) (update every step): 6.72 it/s

With LR schedule using python float (update every step): 6.73 it/s

Hmm interesting, yeah an E2E 4% slowdown does seem like a little much, not totally unexpected because I do expect some slowdown. Perhaps the profile will shed more light on this. A screenshot of the relevant section is also fine too btw, or perhaps you can narrow the region to the optimizer + lr scheduler and share it to lower the size? Also try running a single iteration and seeing if it's still large

trace.json.gz
I gzip-ed the file, so it's more manageable. Lmk if you prefer other format instead (for security reasons).
From my beginner's eyes, most of the optimizer's time is spent on aten::to, aten::copy and cuda stream synchronize.

msaroufim

🚀

gau-nernst added 30 commits June 27, 2024 07:29

add skeleton

7ed135a

add quant and dequant

69bf622

add enough ops coverage to work with Adam

817be03

add _dequant_list

da8d9be

simplify

1a98c71

Merge branch 'pytorch:main' into 8bit_adam

3efc999

add adam int8

72e00f6

flatten uint8 storage

f522ea2

fully modified Adam

777ae26

Merge branch 'pytorch:main' into 8bit_adam

dc0bf6b

update

2c20f63

clean

6e5bfd2

more cleanup

4d96360

update train.py

e7e956a

update train.py

a7061bc

fix

139642c

add device to state['step']

c44eca6

update adam to avoid graph break

26821ab

fix code dtype

81c79dd

add note

c0e00c4

optimize copy

2efa8cf

return

8f268ee

rename folder

290209d

Merge branch 'pytorch:main' into 8bit_adam

2820aa7

add binary search impl

f242b97

move state increment outside adam step

db62f98

remove unused import

be835a0

add a version with torch._fused_adam_()

7628b52

support fp32 state

090f5d7

make wandb optional

7cdef75

gau-nernst added 2 commits July 2, 2024 21:29

add weight decay to the kernel

a38a833

some formatting

72af512

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2024

msaroufim requested review from msaroufim and janeyx99 July 2, 2024 16:39

janeyx99 reviewed Jul 2, 2024

View reviewed changes

msaroufim reviewed Jul 2, 2024

View reviewed changes

torchao/prototype/optim_8bit/adam.py Show resolved Hide resolved

gau-nernst commented Jul 2, 2024

View reviewed changes

gau-nernst added 8 commits July 3, 2024 07:46

move file

cfbf7e5

remove impl using torch._fused_adam_

2c7337b

add tests. fix default values

19b4f99

move file

c0cd149

add AdamW

b4582de

fix Optional

3b3e785

add README

28a66c7

Merge branch 'pytorch:main' into 8bit_adam

1f8eff0

gau-nernst marked this pull request as ready for review July 3, 2024 00:54

skip test for pytorch < 2.3

bb436e6

msaroufim self-requested a review July 3, 2024 03:16

msaroufim approved these changes Jul 3, 2024

View reviewed changes

rename to more user-friendly

d86ec5e

msaroufim merged commit 739952b into pytorch:main Jul 3, 2024
13 checks passed

gau-nernst deleted the 8bit_adam branch July 3, 2024 04:01

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024

8-bit Adam (pytorch#463)

d0feef4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8-bit Adam #463

8-bit Adam #463

gau-nernst commented Jul 2, 2024

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading

msaroufim commented Jul 2, 2024

mlazos commented Jul 2, 2024

janeyx99 Jul 2, 2024 •

edited

Loading

gau-nernst Jul 2, 2024

msaroufim Jul 2, 2024

gau-nernst Jul 3, 2024

gau-nernst Jul 2, 2024

gau-nernst Jul 2, 2024

gau-nernst Jul 2, 2024 •

edited

Loading

mlazos Jul 2, 2024

gau-nernst Jul 2, 2024

gau-nernst Jul 2, 2024

mlazos Jul 3, 2024 •

edited

Loading

gau-nernst Jul 3, 2024

msaroufim left a comment



		# this will work with any optim state tensor subclass that implements aten.lerp.Scalar and aten.copy_.default
		@torch.compile(fullgraph=True, dynamic=True)

8-bit Adam #463

8-bit Adam #463

Conversation

gau-nernst commented Jul 2, 2024

pytorch-bot bot commented Jul 2, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/463

✅ No Failures

msaroufim commented Jul 2, 2024

mlazos commented Jul 2, 2024

janeyx99 Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gau-nernst Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlazos Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading

janeyx99 Jul 2, 2024 •

edited

Loading

gau-nernst Jul 2, 2024 •

edited

Loading

mlazos Jul 3, 2024 •

edited

Loading