Sync with torchtitan #2

mostly testing if new repo works or not

as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations

@tianyu-l

…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87) This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing). 1. tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info. 2. iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs). 3. 'time' is now displayed at each iter along with the usual loss and lr. 4. at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed. 5. based on @tianyu-l feedback: I have added data loading times as well. I used the same timeit.default_timer() from timeit to be consistent. (cpu side so no synch's needed :) 6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates: Now: <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66"> before: <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">

Summary: Summary: Follow up on config unification, options not available in config file are picked from command line defaults. Test Plan: ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-8.0.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache rootdir: /data/users/gnadathur/a/torchtrain configfile: pyproject.toml plugins: cov-4.1.0 collecting ... collected 3 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 66%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [100%] ---------- coverage: platform linux, python 3.10.13-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist ============================== 3 passed in 0.06s =============================== Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>

ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22 Pull Request resolved: pytorch#92

…g on slurm (pytorch#93) This PR adds the ability to do colored console outputs in order to highlight the training data outputs. It also adds a check to not use this color formatting on slurm, where it will add 33= instead of the color if not avoided. Note that I've just added some color to highlight the main training data. Users that fork/clone can use it to enhance their outputs as desired. <img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6"> Note that on slurm it remains plain: <img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2"> if you dont' check this, then it would otherwise look like this (this does not happen with this PR, just showing if we didn't check and credit to Yifu for noting this would be an issue): <img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">

@awgu

this PR updates the GPU metrics to labelling as GiB - we were calculating GiB but calling it GB. (credit to @awgu for flagging this - issue pytorch#94) function names and member vars in metrics.py have been updated to _gib instead of _gb for clarity, and the logging output now labels as GiB: <img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">

ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7 Pull Request resolved: pytorch#96

Enable libUV for torchtrain. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** [rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama [rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912. [rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep: 1 �[32mloss: 10.9229 �[39miter: �[34m 1.9386�[39m data: �[34m0.0368 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep: 2 �[32mloss: 10.8673 �[39miter: �[34m 0.0562�[39m data: �[34m0.0316 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep: 3 �[32mloss: 10.7145 �[39miter: �[34m 0.0523�[39m data: �[34m0.0322 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep: 4 �[32mloss: 10.5038 �[39miter: �[34m 0.0559�[39m data: �[34m0.0319 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep: 5 �[32mloss: 10.2228 �[39miter: �[34m 0.0537�[39m data: �[34m0.031 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep: 6 �[32mloss: 9.9677 �[39miter: �[34m 0.0562�[39m data: �[34m0.0302 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep: 7 �[32mloss: 9.7762 �[39miter: �[34m 0.0544�[39m data: �[34m0.0317 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep: 8 �[32mloss: 9.4359 �[39miter: �[34m 0.0509�[39m data: �[34m0.0322 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep: 9 �[32mloss: 9.2326 �[39miter: �[34m 0.1007�[39m data: �[34m0.0321 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10 �[32mloss: 9.1641 �[39miter: �[34m 0.0971�[39m data: �[34m0.031 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds [rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` --------- Co-authored-by: gnadathur <[email protected]>

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

Add 7b config and adjust options to be more realistic didn't add this to the train scripts as default as it's expensive to init, whoever use it can adjust it accordingly

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>

ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9 Pull Request resolved: pytorch#105

``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** [rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training [rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama [rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model... [rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled. [rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701. [rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep: 1 �[32mloss: 10.8424 �[39miter: �[34m 1.8688�[39m data: �[34m0.0316 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep: 2 �[32mloss: 10.7581 �[39miter: �[34m 0.0476�[39m data: �[34m0.0357 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep: 3 �[32mloss: 10.6239 �[39miter: �[34m 0.045�[39m data: �[34m0.0333 �[39mlr: �[33m0.0008�[39m [rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep: 4 �[32mloss: 10.4163 �[39miter: �[34m 0.0455�[39m data: �[34m0.0323 �[39mlr: �[33m0.0007�[39m [rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep: 5 �[32mloss: 10.1529 �[39miter: �[34m 0.0459�[39m data: �[34m0.032 �[39mlr: �[33m0.0006�[39m [rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep: 6 �[32mloss: 9.8899 �[39miter: �[34m 0.0468�[39m data: �[34m0.0311 �[39mlr: �[33m0.0005�[39m [rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep: 7 �[32mloss: 9.7204 �[39miter: �[34m 0.0461�[39m data: �[34m0.0312 �[39mlr: �[33m0.0004�[39m [rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep: 8 �[32mloss: 9.3757 �[39miter: �[34m 0.0457�[39m data: �[34m0.0319 �[39mlr: �[33m0.0003�[39m [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep: 9 �[32mloss: 9.1883 �[39miter: �[34m 0.0762�[39m data: �[34m0.0318 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10 �[32mloss: 9.1212 �[39miter: �[34m 0.0808�[39m data: �[34m0.0319 �[39mlr: �[33m0.0001�[39m [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Co-authored-by: gnadathur <[email protected]>

This PR enables meta_init functionality to avoid OOM'ing on cpu for larger models. The core functionality is in meta_init.py, and a few changes in parallelization and train.py. Key items: 1 - this is largely the same as the earlier PR I had for meta_init, but I did a new one b/c faster than reworking it with all the interim changes. 2 - to address feedback in previous PR: a - why do we need meta_init.py, can't we just do: ~~~ with torch.device("meta"): model = Model.from_args(...) ~~~ Unfortunately this does not work b/c the rope embeddings are treated differently (buffer) and thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings and the model will fail on first forward. This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened pytorch#110 to track this and investigate while not holding up meta init that is working from landing. b - per earlier feedback - meta init is now 'not optional' but simply the default. This should ensure all models leverage it and ensure we aren't missing things for future meta_init aspects. 3 - misc change - I switched the model_params to just do the normal all params count instead of 'unique params' b/c it does not mesh with what people perceive model size as. Testing: tested both debugmodel and 26B model with and without meta init to confirm same loss curves. Note for future reference - if you get a bad init (meta init failure) you will simply not train (loss is same every iter). If you fail to call reset params after FSDP, then you will train (b/c we default to torch.randn_like) but your starting loss will be 5x+ higher (telling you that you have not properly init'ed the model).

Co-authored-by: gnadathur <[email protected]>

ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d Pull Request resolved: pytorch#114

ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c Pull Request resolved: pytorch#112

@tianyu-l

…data' at 40 iters issue) (pytorch#88) This PR add's minipile (1M, 6GB) dataset as an option for pretraining with torchtrain. It resolves the issue where we run out of data after 40 iterations with the default alpaca dataset. Per @tianyu-l's excellent suggestion, have refactored to have a single hf_datasets.py file that supports both minipile and alpaca since it turned out no need for any different tokenizer, etc. Also cleaned up the datasets package so that create_tokenizer is exposed directly, and thus all public apis can be used directly from torchtrain.datasets. Lastly - added warning if/when a dataset is being re-looped so users don't get burned by overfitting: <img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a"> Adds a color highlight to showcase what dataloader was built: <img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0"> and <img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e"> Usage: just add "minipile" or "alpaca" as the dataset in the training config toml file. <img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27"> Testing: verified training loss is improving and ran for 100 iters to verify no issue with out of data any longer with minipile. reran with alpaca and saw the expected out of data at 40 iters without infinite loop option, runs to 100 with infinite. Notes: I did not make this a default dataset since for debugmodel, mostly running 10 iters is fine and there's 6GB to pull down. <img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">

ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6 Pull Request resolved: pytorch#117

ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad Pull Request resolved: pytorch#121

ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8 Pull Request resolved: pytorch#122

according to suggestions in pytorch#118 (comment) ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651 Pull Request resolved: pytorch#123

@tianyu-l

…t job configs (pytorch#124) This PR: 1 - adds the english language portion of c4 dataset, which has 177M entries. (https://huggingface.co/datasets/allenai/c4) Due to the size, streaming is enabled as the default. This is the allen-ai/c4, as apparently the original c4 is being deprecated and HF advises to use allen-ai now. For comparison per @tianyu-l request - 40 iterations avg time: alpaca cached loading: Average data load time: 0.0279 seconds c4 streaming loading: Average data load time: 0.0290 seconds There is a longer initial delay during the 'preparing c4' vs alpaca (i.e. 45 seconds vs 10 seconds), but after that speed is similar. Dataset sample (not displayed in training, just an excerpt I pulled to double check the data flow): <img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121"> 2 - I also updated the multi-node slurm file to account for the new job config. Test: verified no looping with 100 iterations, sampled data streamed to verify.

…ytorch#130) This PR adds the openwebtext 1M dataset. This is a homogenous dataset, so we are able to train successfully while not having any shuffle in our dataset loader. 1 - adds the dateset to hf_datasets 2 - makes the default dataset for 13b and 70b as openwebtext since that is the preferred choice for larger scale training. Testing - ran 5K iters (9 nodes) to verify no spiking issues: <img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">

…pytorch#131) This fix would temporarily unblock loading. So we won't run into the issue of: ``` [rank0]:[rank0]: train_state.losses.append(train_state.current_loss) [rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append' ``` However, current_loss and losses are still not correct, since by current setup, losses and current_losses would be different across different ranks. Also, we don't know the size of losses because this is based on the # of steps. So loading still work but the value of current_loss and losses are not being loaded correctly. I will follow up with further fixes.

ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a Pull Request resolved: pytorch#132

simplify things given we already have SequenceParallel style landed in main

ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f Pull Request resolved: pytorch#137

ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099 Pull Request resolved: pytorch#136

…ing AC on or off. (pytorch#125) This PR: 1 - adds selective layer checkpointing - this lets the user select every x layer to checkpoint: i.e. 2 = every other layer is checkpointed. spec for config was updated by Wanchao - so we now have this layout for AC which is hopefully self-explanatory (covers None, full, Selective Op or Selective Layer and layer filtering policy: <img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4"> Thus, it lets user toggle between traditional 'all layers' to more and more fine grained checkpointing. Note that I implemented this for IBM last summer and in their llama testing, every 2nd layer was the best bang/buck so I have made that the default. 2 - the config file has been updated to make a new [activation_checkpointing] section and make it easier to modify vs being dumped into the training section. Testing and results: I tested all the AC options to ensure all options are working, and that we fail if both types are set to true in config: <img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e">

ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c Pull Request resolved: pytorch#134

…#103) Timeout ------- It's convenient whether during iterative debugging or long running training to find out asap about a failure. The default timeout is way too long and leads to wasted cluster time or developer frustration. Timeout can be adjusted via cmdline or in .toml if it needs to be larger for a particular model. Another useful pattern can be to set a large timeout for initialization and then tighten it after iteration 1. We can add this later if desired. Ideally we could pass the timeout to the device mesh ctor, but it's not ready yet. Also, we can change timeouts of the existing PGs after creating them, but that's more LOC and not necessary unless we want to change the timeouts at runtime. Dumps ----- Dumping on timeout should be a safe default for everyone. It has the side-effect of requiring a dump path which defaults to ~/pgnccl_dump but can be overridden via DUMP_PATH env. The raw content of the dump is a pickle that is intended to be consumed through scripts/tools which are under development, so it may not be easy to know how to use these for now. As the tooling matures, we should provide reference docs and probably print out pointers in the logs when we perform the dump. Test plan: tested locally by adding a rank0 sleep for 10sec inside the training loop, validating all 8 ranks dumped a trace.

ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1 Pull Request resolved: pytorch#147

Allow a tighter timeout during training than during init. Init includes the first train step, as well as any loading and setup. It can be slower and less predictable due to various factors including lazy initialization or jit compilation. After the first train step, we expect more predictable runtime and benefit from a tighter timeout to give quick feedback on a hang. Tested by pasting this code in 2 places ``` if dp_mesh.get_local_rank() == 0 and train_state.step == 1: import time time.sleep(10) ``` (a) before calling set_pg_timeout, which did not cause a timeout (b) after calling set_pg_timeout, which timed out

(second attempt, didn't land correctly) ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b Pull Request resolved: pytorch#154

@fegin

This PR adds control over Python garbage collection to help avoid stragglers during large scale training. updates - this feature is now exposed as a controllable option gc_schedule, with a default of 50. 0 = not enabled. int = schedules gc at every int iters during training loop. <img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f"> Effectively we disable the gc, run one collection to ensure a good starting point, and then at the start of each gc_schedule iter, we call gc to free up things. By enforcing a fixed schedule for collection, it helps all ranks stay more in synch. Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a perf boost of ~1.5% per iter just by virtue of better synch. (this was originally developed during dist compiler to resolve stragglers, I believe @fegin came up with this solution).

ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b Pull Request resolved: pytorch#151

@Chillee

disable buffer reuse for compile to have close numerics to eager mode, as suggested by @Chillee This is probably only a temp change until buff reuse fix in inductor

This PR supports explicit cmd overrides, to allow infra layers to override certain options (the most important one is dump_folder)

This PR renames sequence_parallel to tensor_parallel, as sequence parallel is only applied to rmsnorm layers, a broader name should be tensor_parallel, maybe with sequence_parallel enabled ghstack broken :( so using direct branch push instead

as titled, currently 13B use selective op, and 70B use selective layer, we can do some more experiments and adjust the configs later

@tianyu-l

…ses and global_max_losses (pytorch#167) Based on discussion with @tianyu-l, we decided to only checkpoint `global_avg_losses` and `global_max_losses` per log frequency iteration to avoid all_reduce and device sync every iteration. `TrainState.current_loss` and `TrainState.losses` are removed from TrainState `state_dict()` and `load_state_dict()` call. Tested with saving/loading with 30 steps with log_frequency = 10 and loading with 40 steps to resume training. The numerics in global_avg_losses and global_max_losses in the list aligns with expected. ``` Step 30 save: [rank0]:before save: self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21]) Step 30 load: [rank0]:after load: self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21]) Step 40 load and resume training: [rank0]:before save: self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31]) ```

Summary: PR adds an option `use_for_integration_test`. when set to `True`, this adds the config to the integration test suite. A test runner picks all the configs marked for integration test and run them. Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946 [rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep: 1 �[32mloss: 10.9486 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 20,066 �[35mmfu: 0.25%�[39m [rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep: 2 �[32mloss: 10.8786 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,046 �[35mmfu: 2.60%�[39m [rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep: 3 �[32mloss: 10.7362 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 204,441 �[35mmfu: 2.50%�[39m [rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep: 4 �[32mloss: 10.5094 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,800 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep: 5 �[32mloss: 10.2755 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,527 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep: 6 �[32mloss: 10.0318 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,117 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep: 7 �[32mloss: 9.7929 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,509 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep: 8 �[32mloss: 9.5539 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 166,639 �[35mmfu: 2.04%�[39m [rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep: 9 �[32mloss: 9.3909 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 120,381 �[35mmfu: 1.47%�[39m [rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10 �[32mloss: 9.2749 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 207,613 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>

Summary: Add a 2D test to integration test suite Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] ***************************************** W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429 [rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep: 1 �[32mloss: 10.7425 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 21,337 �[35mmfu: 0.26%�[39m [rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep: 2 �[32mloss: 10.6722 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 208,060 �[35mmfu: 2.55%�[39m [rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep: 3 �[32mloss: 10.5435 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 213,622 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep: 4 �[32mloss: 10.3359 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,856 �[35mmfu: 2.61%�[39m [rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep: 5 �[32mloss: 10.0965 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 209,326 �[35mmfu: 2.56%�[39m [rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep: 6 �[32mloss: 9.8806 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,808 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep: 7 �[32mloss: 9.6442 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,874 �[35mmfu: 2.63%�[39m [rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep: 8 �[32mloss: 9.4349 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 220,877 �[35mmfu: 2.70%�[39m [rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep: 9 �[32mloss: 9.2674 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 123,924 �[35mmfu: 1.52%�[39m [rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10 �[32mloss: 9.1404 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,910 �[35mmfu: 2.63%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 =====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model_2d.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] ***************************************** W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2] [rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430 [rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep: 1 �[32mloss: 10.8502 �[33mmemory: 5.71GiB(6.01%) �[34mwps: 9,259 �[35mmfu: 0.11%�[39m [rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep: 2 �[32mloss: 10.7671 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 54,430 �[35mmfu: 0.67%�[39m [rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep: 3 �[32mloss: 10.6390 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 88,457 �[35mmfu: 1.08%�[39m [rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep: 4 �[32mloss: 10.4210 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 90,384 �[35mmfu: 1.11%�[39m [rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep: 5 �[32mloss: 10.1648 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 93,058 �[35mmfu: 1.14%�[39m [rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep: 6 �[32mloss: 9.9451 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 72,642 �[35mmfu: 0.89%�[39m [rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep: 7 �[32mloss: 9.7004 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 85,096 �[35mmfu: 1.04%�[39m [rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep: 8 �[32mloss: 9.4422 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 95,860 �[35mmfu: 1.17%�[39m [rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep: 9 �[32mloss: 9.2144 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 55,837 �[35mmfu: 0.68%�[39m [rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10 �[32mloss: 9.1710 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 86,136 �[35mmfu: 1.05%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>

@lessw2020

**Numeric Parity** 1D FSDP - Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter - FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS - FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS - FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS - FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS - Loss curves match between FSDP1 and FSDP2 - Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory - Compile: same setup as eager - FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU - Loss curves slightly better than eager - For fun -- how much can we push MFU? - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU. - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU. - Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), `reshard_after_forward=False` for the last transformer block 2D FSDP - Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter - FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS - FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS - Loss curves match 8-way FSDP - FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_` not all-reducing over TP mesh dimension <details> <summary> Loss curves </summary> <img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM" src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85"> </details> **Meta-Device Initialization** - The PyTorch Core guideline is for `module.reset_parameters()` to only initialize parameters/buffers immediately owned by `module` (i.e. `module.parameters(recurse=False)` and `module.buffers(recurse=False)`). - This makes it challenging to specify custom initializations for core modules like `nn.Linear` and `nn.Embedding`. For example, in @lessw2020's depth-wise truncated normal initialization, the `trunc_normal_` standard deviation depends on the layer ID, which is a property of the `TransformerBlock` but affects the child `nn.Linear`s. - To disambiguate, I suggest avoiding the name `reset_parameters()` in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. `init_weights`). **DCP & Save/Load** - Tested 1D and 2D by specifying `checkpoint_folder = "/tmp/checkpoint_andgu` in the `.toml`, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable

ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b Pull Request resolved: pytorch#173

ghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295 Pull Request resolved: pytorch#178

Summary: same as title Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model_compile.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] ***************************************** W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training [rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace [rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model [rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model [rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled [rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754 [rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled [rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754 [rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank1]: warnings.warn( [rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank0]: warnings.warn( [rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep: 1 �[32mloss: 10.9746 �[33mmemory: 9.53GiB(10.03%) �[34mwps: 1,228 �[35mmfu: 0.02%�[39m [rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep: 2 �[32mloss: 10.9091 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 208,739 �[35mmfu: 2.56%�[39m [rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep: 1 �[32mloss: 10.9746 �[33mmemory: 9.53GiB(10.03%) �[34mwps: 1,228 �[35mmfu: 0.02%�[39m [rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep: 2 �[32mloss: 10.9091 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 208,501 �[35mmfu: 2.55%�[39m [rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep: 3 �[32mloss: 10.7722 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 219,416 �[35mmfu: 2.69%�[39m [rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep: 3 �[32mloss: 10.7722 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 219,182 �[35mmfu: 2.68%�[39m [rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep: 4 �[32mloss: 10.5428 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 218,226 �[35mmfu: 2.67%�[39m [rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep: 4 �[32mloss: 10.5428 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 218,015 �[35mmfu: 2.67%�[39m [rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep: 5 �[32mloss: 10.3063 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,094 �[35mmfu: 2.54%�[39m [rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep: 5 �[32mloss: 10.3063 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,220 �[35mmfu: 2.54%�[39m [rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep: 6 �[32mloss: 10.0707 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 210,814 �[35mmfu: 2.58%�[39m [rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep: 7 �[32mloss: 9.8302 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 209,649 �[35mmfu: 2.57%�[39m [rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep: 6 �[32mloss: 10.0707 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 210,849 �[35mmfu: 2.58%�[39m [rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep: 7 �[32mloss: 9.8302 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 209,542 �[35mmfu: 2.57%�[39m [rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep: 8 �[32mloss: 9.5918 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 211,690 �[35mmfu: 2.59%�[39m [rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep: 8 �[32mloss: 9.5918 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 211,786 �[35mmfu: 2.59%�[39m [rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep: 9 �[32mloss: 9.4299 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 125,833 �[35mmfu: 1.54%�[39m [rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep: 9 �[32mloss: 9.4299 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 125,765 �[35mmfu: 1.54%�[39m [rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10 �[32mloss: 9.2955 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,661 �[35mmfu: 2.54%�[39m [rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10 �[32mloss: 9.2955 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,426 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>

ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db Pull Request resolved: pytorch#190

this PR bumps this default config to a larger value, as profiling is pretty heavy step, a default 5 seconds would likely trigger watchdog unintentionally

ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e Pull Request resolved: pytorch#197

address comments ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3 Pull Request resolved: pytorch#198

currently, planning to use a 'seed checkpoint' to initialize the pipeline parallel model chunks after moving them from meta device to cuda/empty. non-persistent buffers are incompatible with this approach, as they are missing from the checkpoint and thus require manual init. an alternative is to manually run the initializer for just the non-persistent buffers after loading a seed-checkpoint, but this approach is nearly equivalent with less code changes. ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934 Pull Request resolved: pytorch#201

grad scaler currently doesn't work with FSDP2, and isn't enabled anyway becuase bf16 training is the norm and doens't require it. remove it for simplicity. it will be easier to enable pipeline parallelism with a simplier loss function setup, but if desired, its still possible to support pipeline parallelism with the scaler added back in. ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f Pull Request resolved: pytorch#202

PP requires feeding a loss_fn into the schedule's step so that loss can be computed per microbatch as part of the forward/backward scheduling. As such, it is nice to define loss once and use it both in the non-pp code that manually calls f/loss/b and also use it in the pp step(). ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12 Pull Request resolved: pytorch#203

The changes made in github editor didn't go in when doing ghstack land.

@yifuwang

… config selectable Norm Type (pytorch#181) This PR has multiple aspects: 1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's numerical accuracy on both forward and backward with a unit test. It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%: <img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7"> 2 - Adds norms.py to house all 4 norm types, and standardizes to [layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a create_norms function that then creates the appropriate norm. 3 - Adds np_layernorm, which is layernorm with no affine transformation. 4 - Updates model.py to now support plug and play of any supported norm. Thus instead of this type of if/then logic in the model class: <img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693"> We simply have this: <img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2"> This then allows for easy plug and play of any norm type with no fiddling around in the model code. 5 - updates run_llama_train.sh to randomly select a port vs previous fixed port number. (thanks @yifuwang for this tip!) 6 - Now users can quickly select the norm of their choice via the config file: <img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb"> 7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid any confusion (per @tianyu-l feedback): ~~~ NotImplementedError: fused_rmsnorm not yet compatible with TP. Please use rmsnorm. ~~~

ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e Pull Request resolved: pytorch#206

ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7 Pull Request resolved: pytorch#213

ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336 Pull Request resolved: pytorch#212

ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc Pull Request resolved: pytorch#195

as titled, we migrated to the native functional collective so the SAC should capture this instead of the old one

Summary: Test runner should raise exception on failures. Test Plan: ``` =====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh ===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 0 -ne 0 ']' =====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=--training.compile + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 1 -ne 0 ']' + overrides=--training.compile + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last): [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module> [rank0]:[rank0]: main(config) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper [rank0]:[rank0]: return f(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main [rank0]:[rank0]: pred = model(input_ids) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]:[rank0]: return self._call_impl(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]:[rank0]: return forward_call(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn [rank0]:[rank0]: return fn(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]:[rank0]: return self._call_impl(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl [rank0]:[rank0]: result = forward_call(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors [rank0]:[rank0]: return callback(frame, cache_entry, hooks, frame_state, skip=1) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame [rank0]:[rank0]: result = inner_convert( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert [rank0]:[rank0]: return _compile( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function [rank0]:[rank0]: return function(*args, **kwargs) [rank0]:[rank0]: File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner [rank0]:[rank0]: return func(*args, **kwds) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile [rank0]:[rank0]: guarded_code = compile_inner(code, one_graph, hooks, transform) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper [rank0]:[rank0]: r = func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner [rank0]:[rank0]: out_code = transform_code_object(code, transform) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object [rank0]:[rank0]: transformations(instructions, code_options) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn [rank0]:[rank0]: return fn(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run [rank0]:[rank0]: super().run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function [rank0]:[rank0]: return variables.UserFunctionVariable(fn, source=source).call_function( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function [rank0]:[rank0]: return func_var.call_function(tx, [obj_var] + args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX [rank0]:[rank0]: self.call_function(fn, argsvars.items, kwargsvars) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function [rank0]:[rank0]: return self.obj.call_method(tx, self.name, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method [rank0]:[rank0]: return wrap_fx_proxy( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy [rank0]:[rank0]: return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls [rank0]:[rank0]: example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value [rank0]:[rank0]: raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value [rank0]:[rank0]: ret_val = wrap_fake_exception( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception [rank0]:[rank0]: return fn() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda> [rank0]:[rank0]: lambda: run_node(tx.output, node, args, kwargs, nnmodule) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node [rank0]:[rank0]: raise RuntimeError(make_error_message(e)).with_traceback( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node [rank0]:[rank0]: return getattr(args[0], node.target)(*args[1:], **kwargs) [rank0]:[rank0]: torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:0', size=(852480,), dtype=torch.bfloat16),), **{}): [rank0]:[rank0]: 'FakeTensor' object has no attribute 'wait' [rank0]: [rank0]:[rank0]: from user code: [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward [rank0]:[rank0]: h = layer(h, freqs_cis) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl [rank0]:[rank0]: args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward [rank0]:[rank0]: args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward [rank0]:[rank0]: self.wait_for_unshard() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard [rank0]:[rank0]: foreach_all_gather_copy_out( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]:[rank0]: return func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out [rank0]:[rank0]: all_gather_work.wait() [rank0]: [rank0]:[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information [rank0]: [rank0]: [rank0]:[rank0]: You can suppress this exception and fall back to eager by setting: [rank0]:[rank0]: import torch._dynamo [rank0]:[rank0]: torch._dynamo.config.suppress_errors = True [rank0]: E0410 13:32:53.256000 139839630783488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1554760) of binary: /home/gnadathur/local/a/pytorch-env/bin/python E0410 13:32:53.261000 139839630783488 torch/distributed/elastic/multiprocessing/errors/error_handler.py:136] no error file defined for parent, to copy child error file (/tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/0/error.json) [rank0]:NCCL version 2.20.5+cuda12.0 Traceback (most recent call last): File "/home/gnadathur/local/a/pytorch-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/run.py", line 879, in main run(args) File "/data/users/gnadathur/a/pytorch/torch/distributed/run.py", line 870, in run elastic_launch( File "/data/users/gnadathur/a/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/gnadathur/a/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-04-10_13:32:49 host : devvm4378.nao0.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 1554762) error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main pred = model(input_ids) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl result = forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame result = inner_convert( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert return _compile( File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function return function(*args, **kwargs) File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object transformations(instructions, code_options) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run super().run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function return variables.UserFunctionVariable(fn, source=source).call_function( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function return func_var.call_function(tx, [obj_var] + args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function return self.obj.call_method(tx, self.name, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method return wrap_fx_proxy( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value ret_val = wrap_fake_exception( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception return fn() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node return getattr(args[0], node.target)(*args[1:], **kwargs) torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:1', size=(852480,), dtype=torch.bfloat16),), **{}): 'FakeTensor' object has no attribute 'wait' from user code: File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward h = layer(h, freqs_cis) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward self.wait_for_unshard() File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard foreach_all_gather_copy_out( File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out all_gather_work.wait() Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True [2]: time : 2024-04-10_13:32:49 host : devvm4378.nao0.facebook.com rank : 2 (local_rank: 2) exitcode : 1 (pid: 1554763) error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main pred = model(input_ids) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl result = forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame result = inner_convert( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert return _compile( File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function return function(*args, **kwargs) File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object transformations(instructions, code_options) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run super().run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function return variables.UserFunctionVariable(fn, source=source).call_function( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function return func_var.call_function(tx, [obj_var] + args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function return self.obj.call_method(tx, self.name, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method return wrap_fx_proxy( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value ret_val = wrap_fake_exception( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception return fn() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node return getattr(args[0], node.target)(*args[1:], **kwargs) torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:2', size=(852480,), dtype=torch.bfloat16),), **{}): 'FakeTensor' object has no attribute 'wait' from user code: File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward h = layer(h, freqs_cis) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward self.wait_for_unshard() File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard foreach_all_gather_copy_out( File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out all_gather_work.wait() Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True [3]: time : 2024-04-10_13:32:49 host : devvm4378.nao0.facebook.com rank : 3 (local_rank: 3) exitcode : 1 (pid: 1554764) error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main pred = model(input_ids) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl result = forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame result = inner_convert( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert return _compile( File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function return function(*args, **kwargs) File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object transformations(instructions, code_options) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.p ```

Avoid diverging the model structure (FQNs and checkpoint interoperability) with similar models. This reverts commit f30202c. ghstack-source-id: 9811f5fa99fdde387efe6018aa00afd28e7e923b Pull Request resolved: pytorch#214

Use `rmsnorm` instead of fused version since 2D does not support fused version yet. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=--training.tensor_parallel_degree + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 3 -ne 0 ']' + overrides='--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm' + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.tensor_parallel_degree 2 --model.norm_type=rmsnorm W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] ***************************************** W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 15:50:37,794 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 15:50:37,986 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 15:50:38,464 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2] [rank0]:2024-04-10 15:50:38,467 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 15:50:38,474 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 15:50:38,474 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 15:50:40,306 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='rmsnorm') [rank0]:2024-04-10 15:50:40,318 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 15:50:40,319 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 15:50:40,331 - root - INFO - Applied Tensor Parallelism to the model [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 15:50:40,558 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 15:50:40,558 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1550 [rank0]:2024-04-10 15:50:40,562 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 15:50:40,562 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-04-10 15:50:41,474 - root - INFO - �[36mstep: 1 �[32mloss: 10.8403 �[33mmemory: 5.76GiB(6.06%) �[34mwps: 8,988 �[35mmfu: 0.11%�[39m [rank0]:2024-04-10 15:50:41,475 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-04-10 15:50:41,652 - root - INFO - �[36mstep: 2 �[32mloss: 10.7703 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 46,364 �[35mmfu: 0.57%�[39m [rank0]:2024-04-10 15:50:41,744 - root - INFO - �[36mstep: 3 �[32mloss: 10.6447 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,916 �[35mmfu: 1.10%�[39m [rank0]:2024-04-10 15:50:41,847 - root - INFO - �[36mstep: 4 �[32mloss: 10.4428 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 80,467 �[35mmfu: 0.99%�[39m [rank0]:2024-04-10 15:50:41,946 - root - INFO - �[36mstep: 5 �[32mloss: 10.1726 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 83,747 �[35mmfu: 1.03%�[39m [rank0]:2024-04-10 15:50:42,038 - root - INFO - �[36mstep: 6 �[32mloss: 9.9676 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,380 �[35mmfu: 1.09%�[39m [rank0]:2024-04-10 15:50:42,135 - root - INFO - �[36mstep: 7 �[32mloss: 9.7356 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,526 �[35mmfu: 1.05%�[39m [rank0]:2024-04-10 15:50:42,232 - root - INFO - �[36mstep: 8 �[32mloss: 9.4619 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,349 �[35mmfu: 1.05%�[39m [rank0]:2024-04-10 15:50:42,396 - root - INFO - �[36mstep: 9 �[32mloss: 9.2633 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 50,402 �[35mmfu: 0.62%�[39m [rank0]:[rank0]:[W410 15:50:42.021475256 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-04-10 15:50:42,511 - root - INFO - �[36mstep: 10 �[32mloss: 9.2156 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 71,449 �[35mmfu: 0.88%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ```

ghstack-source-id: a9204c68f2e315c878677be86c509fc8d6290ffd Pull Request resolved: pytorch#218

pytorch#220) With `model_weights_only` set to True, we would checkpoint model weights only at the end of the training. We only consider saving model weights at the end of the training so this won't affect preemption and training resume. With `model_weight_only = True`, we can see the size of checkpoint is 1/3 of a full checkpoint (74M at step 10 when training completes vs. 212M at step 5). With this, the converted checkpoint (DCP -> torch.save) can be loaded with `torch.load(..., weights_only=True)`. ``` (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ python -m torch.distributed.checkpoint.format_utils dcp_to_torch step-10 step-10-model-weights-only.pt Converting checkpoint from step-10 to step-10-model-weights-only.pt using method: 'dcp_to_torch' (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ ls step-10 step-10-model-weights-only.pt step-5 (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ ls -h step-10 step-10-model-weights-only.pt step-5 (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ du -h 212M ./step-5 74M ./step-10 358M . (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ du -h step-10-model-weights-only.pt 74M step-10-model-weights-only.pt (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ python3 Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.load('step-10-model-weights-only.pt', weights_only=True) {'model': {'embeddings.freqs_cis': tensor([[ 1.0000+0.0000e+00j, 1.0000+0.0000e+00j, 1.0000+0.0000e+00j, ..., 1.0000+0.0000e+00j, 1.0000+0.0000e+00j, 1.0000+0.0000e+00j], ``` One more additional change: logging to all ranks on `test_runner.py`.

Trying out a full renaming pass from torchtrian -> torchtitan, including: 1. directory structure 2. all names inside the repo itself.

…h#223)

Add the delay as a short term workaround the TCPStore cleanup sync issue (pytorch/pytorch#123969) Test: Ran `TORCH_NCCL_ABORT_IN_DESTROY_PG=1 CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_llama_train.sh --checkpoint.folder ./test_runner_checkpoint_full_checkpoint` 10 times w/o failure.

…ch#222) Adds a field of checkpoint.export_dtype: we allow dtype conversion only when we are checkpoint model weights only and the current dtype is not the same as the export dtype at the end of the training. Also add a change to get rid of `freqs_cis` buffer when exporting. We can see with export_dtype=bf16, the model weights is about half of the size when comparing to export_dtype=fp32. ``` # model_weights_only=false (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_full_checkpoint 212M test_runner_checkpoint_full_checkpoint/step-5 212M test_runner_checkpoint_full_checkpoint/step-10 212M test_runner_checkpoint_full_checkpoint/step-15 212M test_runner_checkpoint_full_checkpoint/step-20 846M test_runner_checkpoint_full_checkpoint # model_weights_only=true and export_dtype = fp32 (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_model_weights_only 212M test_runner_checkpoint_model_weights_only/step-5 70M test_runner_checkpoint_model_weights_only/step-10 281M test_runner_checkpoint_model_weights_only # model_weights_only=true and export_dtype = bf16 (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_model_weights_only_bf16 212M test_runner_checkpoint_model_weights_only_bf16/step-5 35M test_runner_checkpoint_model_weights_only_bf16/step-10 247M test_runner_checkpoint_model_weights_only_bf16 ```

…ytorch#225)

ghstack-source-id: 33295ce9c9038163e903867cd81799e8848cc749 Pull Request resolved: pytorch#228

as titled, update README to reflect our positioning for the repo

Reworked readme to highlight first release and feature set. q - use our logo? (I think it adds some spark). Visual preview: <img width="898" alt="Screenshot 2024-04-14 at 7 02 39 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/60b4b6a8-c4f3-41a8-8d8d-27b924f8de15">

update logo to permalink to ensure viewable by all.

@tianyu-l

… a few config updates (pytorch#230) Let CheckpointManager take entire job_config as an arg so we can keep train.py a little bit cleaner. Discussed with @tianyu-l and made a few additional changes, including: 1. Rename "run_profiler" to "enable_profiling" 2. Add an "enable_checkpoint" flag so it is consistent to "enable_profiling" or "enable_tensorboard". We feel like this is a little bit more explicit. 3. Change the default checkpoint folder to be ".outputs/checkpoint" when checkpoint is enabled. 4. Rename "folder" in [checkpiont]" to be "checkpoint_folder" 5. Change save_traces_folder to be "./outputs/profile_trace" from ".outputs/profiling/traces".

It seems the permalink for the logo is not fully working as expected. thus switching to combo of html plus local file reference for src.

…ch#238) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#237 WPS / MFU numbers, and loss curves jobs can be found from this tracking [spreadsheet](https://docs.google.com/spreadsheets/d/11kcula5ybuABSZkm2OlFng5NQ9_rnVB-KRyeQq6P7fo/edit#gid=0). Co-authored-by: tianyu-l <[email protected]>

…t conversion (pytorch#235)

…tion (pytorch#241) This PR: 1 - add's feature note and link to checkpoint doc on supporting torchtitan weights being saved and loaded into torchtune for fine tuning. 2 - moves the product position info from top of page to bottom.

…on, update checkpointing from 5 to 500 (pytorch#243) 3 minor readme / doc updates. 1 - remove : and please note from product position statement. 2 - remove (asynch checkpointing) from current feature listing of dist checkpointing (it's noted as pending feature). 3 - update default checkpoint interval from 5 to 500

Summary: use `"""` for multi-line strings instead of tuple syntax which breaks arg parse. Test Plan: ``` ============================= test session starts ============================== platform linux -- Python 3.10.14, pytest-8.1.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/gnadathur/a/torchtitan/.hypothesis/examples')) benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /data/users/gnadathur/a/torchtitan configfile: pyproject.toml plugins: hypothesis-6.100.1, benchmark-4.0.0, typeguard-4.2.1, cov-5.0.0, hydra-core-1.3.2 collecting ... collected 6 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 16%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 33%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [ 50%] test/test_job_config.py::TestJobConfig::test_empty_config_file PASSED [ 66%] test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides PASSED [ 83%] test/test_job_config.py::TestJobConfig::test_print_help PASSED [100%] ---------- coverage: platform linux, python 3.10.14-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.00s call test/test_job_config.py::TestJobConfig::test_print_help 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides 0.00s call test/test_job_config.py::TestJobConfig::test_empty_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_print_help 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_empty_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_print_help 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides 0.00s teardown test/test_job_config.py::TestJobConfig::test_empty_config_file ============================== 6 passed in 0.19s =============================== ```

ghstack-source-id: 287d31e9a14861244f1292f61604a296fb7d4e11 Pull Request resolved: pytorch#245

as titled, looks like llama2 default one is 2048 instead of the current number (source https://github.com/meta-llama/llama/blob/main/llama/model.py#L31)

…orch#249) as titled, we need to set this to get the accurate seq_length set from the dataloader config. This would ensure the max_seq_len always correct so that rope init would be always correct <img width="946" alt="Screenshot 2024-04-17 at 1 00 29 PM" src="https://github.com/pytorch/torchtitan/assets/9443650/39942187-cf37-4cef-b380-644a1a9b9d35">

ghstack-source-id: e7f7f4d6f1685072ded6da899bac3ed1ba22dffa Pull Request resolved: pytorch#247

ghstack-source-id: 7c390da9d746a75a8c93811c21fb92fb418ae08b Pull Request resolved: pytorch#252

This PR adds a 45K (and thus just under the github 100MB limit) local dataset. This enables: a - a ready to run dataset for users to run debug model with b - local dataset for CI c - dataset that is not relying on HuggingFace connection (recall when HF went down and everything came to halt). <img width="1275" alt="Screenshot 2024-04-17 at 8 09 13 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/89df4ea8-37f4-4705-a6ed-4ca9415409f3">

as per title - remove logo until we have marketing approval and update readme pre-release date from 4/16 to 4/18.

testing embedding video into readme. Note that embedded videos are not supported, so the best we can do here is mimic it with a thumbnail and play button, that then jumps you to YT playing the video.

@tianyu-l

add performance.md to show the convergence curves (file is from @tianyu-l ).

This PR adds support for Llama3 8b/70b, mainly it: - add tiktonizer, add instructions to download tokenizer - add options for the llama model to support Llama3 - add Llama3 8b/70b configs

ghstack-source-id: 4dd1cdb033e840e00cacd98339780424231b595b Pull Request resolved: pytorch#257

as titled, the test tokenizer borrowed from torchtune https://github.com/pytorch/torchtune/blob/main/tests/assets/tiktoken_small.model, where this small test model is offline generated from https://gist.github.com/ebsmothers/54b133dd87db6679b14318545aaa2de4 so it should have no correlation with any specific model/data

ghstack-source-id: b4fe7f63f15bab367cf00b5d408eb43c640541c2 Pull Request resolved: pytorch#262

ghstack-source-id: a9bd1d33bf7bc9f5055a645c9639bcbe628afbfb Pull Request resolved: pytorch#258

ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735 Pull Request resolved: pytorch#261

This PR is mainly to fix the spelling where activation checkpointing is missing an n... (**checkpoiting**). Not sure how I missed it earlier but it's glaring when you see the charts in visual form (vs text). <img width="578" alt="Screenshot 2024-04-24 at 2 45 25 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/a81727b2-07b1-4d69-a0c1-743d74d2aa5a"> fixed: <img width="592" alt="Screenshot 2024-04-24 at 3 10 30 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/769e51db-4aa6-4dbd-99d8-7e691658e280"> Also add a couple line breaks to help with layout, and one or two minor grammar updates.

Update to final legal license terms requested by Meta legal for release.

ghstack-source-id: 2b74fe48dbeae0367a41214c6d0e8b1fcd608db8 Pull Request resolved: pytorch#270

(as title says)

* Image was very blurry * Markdown formatting was off * Simplified some sentences

ghstack-source-id: 606aee2c4815173958b30ca34a3dbf8e90aed8de Pull Request resolved: pytorch#275

ghstack-source-id: cc29739b147fe1f52bfc5b791330fd7cf1659652 Pull Request resolved: pytorch#271

1. update readme 2. small refactor to loss_parallel part

ghstack-source-id: d410f30ec715bfb4206459becb95abeed5a4ae02 Pull Request resolved: pytorch#281

ghstack-source-id: 77f650e8281dae12f2a7ccdb415be88f9abd88cc Pull Request resolved: pytorch#283

# Summary Add more the possible options in the configs and add a note on how to get the dependency at the top of the file.

ghstack-source-id: dbd201ad2976537487123fa583c86ddab06a7387 Pull Request resolved: pytorch#250

as titled, fixes pytorch#286

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287

Runs the integration test hourly and updates signal badge. Tested on existing integration test. I will update the badge with periodic test signal once workflow has landed in this PR. <img width="516" alt="Screenshot 2024-04-30 at 6 12 00 PM" src="https://github.com/pytorch/torchtitan/assets/1779702/8adaab3d-df18-483d-a39f-5af316b7edbc">

ghstack-source-id: 9daa99020c76fdfe429b6a9ee6d44fd1dd319fc3 Pull Request resolved: pytorch#280

Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and then save its initial state as a step-0 checkpoint for use with meta-initialization loading flow. ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454 Pull Request resolved: pytorch#172

ghstack-source-id: fa9aaf337b5489d88945f15b65a8ba8cc544ded6 Pull Request resolved: pytorch#295

This appears to be a holdover from a previous way the initialization worked. freqs_cis should already be on gpu device after initialization. ghstack-source-id: 7159320d4ecfb436bd2193277a88c04d136e9ad0 Pull Request resolved: pytorch#298

…int (pytorch#293) Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

@bdhirsh

as titled. This could make 1-D and 2-D works with the lastest main build. thanks @bdhirsh for all the fixes! We should figure out why dynamic shape gets turned on as a follow up

ghstack-source-id: bbedad3819ab9ef90b233209c34dd1dbc846b06a Pull Request resolved: pytorch#299

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue. ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd Pull Request resolved: pytorch#313

as titled, we can directly specify the rowwise parallel embedding output layouts be shard on sequence dim, so that we don't need the first layer prepare input. Switching to output_layouts = Shard(1) would also trigger reduce_scatter instead of allreduce for embedding layer, which could give some small perf wins

.metadata may be missing in some checkpoints if some ranks did not checkpoint properly. This PR filters out checkpoints that do not have .metadata in them.

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e0734e53260366878dfe0fac5e1ab48f1d Pull Request resolved: pytorch#321

A few small changes here lets manual PP frontend 'reconfigure' a whole transformer model to a stage's portion simply by setting undesired layers to None (in cases of top level layers) or deleting them from the ModuleDict (for 'layers.*'). These changes don't impact the FQNs of the remaining layers, which is critical for checkpoint load/save compatibility. ghstack-source-id: 48a7aafc89d86c3168f905560a4cd6bf4b5b9a12 Pull Request resolved: pytorch#322

ghstack-source-id: b1fa8d8c1778ecb532ed71792ead9f4dbb067cf4 Pull Request resolved: pytorch#325

…needed ghstack-source-id: e460a8d6458f191f7f589fc908974f896b514690 Pull Request resolved: pytorch#333

…for CI (pytorch#338) Adopt from PyTorch, this workflow will prepare the Docker image `torchtitan-ubuntu-20.04-clang12` for the CI. * Base on https://hub.docker.com/layers/nvidia/cuda/12.1.0-cudnn8-runtime-ubuntu20.04/images/sha256-35d5a8eb50ad37fe707a7611a4e20414c5bd2f168adca0cf1700fe2d58411759 to include NVIDIA dependencies. * Install `dev-requirements.txt` and `requirements.txt`. I need to move these files from the top level to `.ci/docker` directory and create softlinks for them because docker build process will only take a look at `.ci/docker`. This is the reason why PyTorch keeps its CI requirements file there. * Install clang or gcc * Install conda (with python 3.11) `torchtitan-ubuntu-20.04-clang12` can then be used as the input for `docker-image`.

ghstack-source-id: 55302fd52dd6ee452c795e89170d0b1299218c87 Pull Request resolved: pytorch#342

also wrap logic into functions and clean up global vars ghstack-source-id: 815c582011611a71005cc22bbd14310900465377 Pull Request resolved: pytorch#343

add training.mixed_precision_param and .mixed_precision_reduce options refactor a util to map strings to torch dtypes ghstack-source-id: 387e1ca13ad23e859d21d7760f858ee6e269a796 Pull Request resolved: pytorch#348

@tianyu-l

…buffer (pytorch#279) Summary: Use the stateful_dataloader from torchdata (https://github.com/pytorch/data/tree/main/torchdata/stateful_dataloader) for storing the token buffer and iteration data order. It requires a dependency on the nightly build of torchdata >= 20240426. Also make sure the dataloader state has a different key per rank. Test Plan: Tested locally by first running 30 steps (checkpointing every 5 steps) and capturing all the loss values. Then deleting the last 3 checkpoints and then re-run the training and the loss values from step 16-30 match with what we had earlier in the first run. Note that this requires changes in the train.py to enable a deterministic run. Reviewers: @tianyu-l Subscribers: @andrewkho Tasks: Tags:

runs PP+DP and PP+TP without issue, runs PP+TP+DP with decreasing loss, but fails DCP save Supports only simple schedules currently, gpipe and 1f1b. Ads cmdline/toml arg for specifiying split points, in a unified way between tracer or manual frontend. e.g. user can specifiy "layers.2,layers.4" as split points. Currently uses manual frontend by default, but allows specifying tracer frontend. Tracer frontend requires working around additional compatibility limitations, indicated by raising assertions, and is not ready for wider use yet. ghstack-source-id: d7e0a1342bc97d6f1bba9e647234d90688ad708f Pull Request resolved: pytorch#318

…mport failure ghstack-source-id: 4db9ec111c83f7873253f19f0c95a997800e0f6b Pull Request resolved: pytorch#353

…ch#268) This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later

previous change to use logging from torchtitan caused stdout not to show up. ghstack-source-id: 30a77c59ba68043ffa844be0443d5351d9584fab Pull Request resolved: pytorch#352

@wanchaol

…orch#350) cc @wanchaol @lessw2020 @wconstab

mostly harmless bug, since output shape of last layer is not used for send/recv purpose, the runtime value overrides it no matter what value you configured it with. However, since adding in/out shape validation to pipeline lib in torch, this raises an error and has to be fixed. ghstack-source-id: 950e41529b7b506085ab280d8a492e345eaefd24 Pull Request resolved: pytorch#354

APIs conform to the pytorch rules. This PR should be able to land safely after tonight's nightly pytorch build which includes the above PR. ghstack-source-id: c575bc7835472128c09798544caa38bf1908e5ca Pull Request resolved: pytorch#356

After updating today, I found a whole slew of various new temp files clogging up my source tab. This PR screens these out so that they don't accidentally get added in a PR and keeps your source tab change count correct. Example of issue without this PR: <img width="780" alt="Screenshot 2024-05-23 at 9 21 55 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/41b7061a-41a0-4a95-938b-3fd9292a2f38"> vs with this PR: <img width="661" alt="Screenshot 2024-05-23 at 10 07 16 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/cccf8c5f-368d-40a8-b10f-f11ca37df2bc">

- switch to using public PipelineStage API - clean up some asserts in tracer codepath ghstack-source-id: 2d069b7d45c4f3c788dec8fc85d8a7e83e463fcd Pull Request resolved: pytorch#357

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339

ghstack-source-id: 8cbd62b97816ae8185b8a7e1aa9a7505f2780525 Pull Request resolved: pytorch#372

Usage: `--test <test_id>` Acceptable values: `test_id` in `build_test_list` (default: all) Example: ``` rm -rf outputs && python test_runner.py outputs --test pp_gpipe ```

ghstack-source-id: 775591945ff5427cb7e5e9fc7592952b4c746341 Pull Request resolved: pytorch#373

Adds a config that purges old checkpoints. Useful for pretraining with frequent checkpointing and large step counts.

ghstack-source-id: 4eb7a6e10812a11c5fd8589e2ff86e5bdb36f968 Pull Request resolved: pytorch#377

ghstack-source-id: 9d52af302c797e9ac81f1113506f3bab261bf312 Pull Request resolved: pytorch#380

ghstack-source-id: ba1d77e5825a26632fe9b7509a88b44509cac45f Pull Request resolved: pytorch#381

Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 7f60d1045f240327ae41ade3a353aff19d2f289a Pull Request resolved: pytorch#383

and optimizers ghstack-source-id: 190220813ece188728a3c776e6839a323009f719 Pull Request resolved: pytorch#360

Enables PP+DP+TP and adds CI test case that runs on 8-gpu CI runner. ghstack-source-id: 7e2d6879d39e78fc7e6d46fd775bb6dfe08ff708 Pull Request resolved: pytorch#344

With these three PRs landed, we can now support the option fused=True in torchtitan for Adam and AdamW optimizer. pytorch/pytorch#125369 pytorch/pytorch#126423 pytorch/pytorch#126750 Run performance evaluation on 8 A100 DevGPU: 1000 steps on 1D DP default [llama_8b.toml](https://github.com/pytorch/torchtitan/blob/main/train_configs/llama3_8b.toml). Observation: For `fused = True` and `fused = False`, we observed similar loss curve and memory usage. wps is + ~100 and mfu is + 1.5-2% when fused = True. Below are the logs for the last 100 steps for both. ``` **Fused = False** [rank0]:2024-06-05 12:45:06,227 - root - INFO - Finished dumping traces in 0.37 seconds [rank0]:2024-06-05 12:45:37,677 - root - INFO - step: 910 loss: 4.6039 memory: 59.48GiB(75.15%) wps: 2,217 mfu: 41.16% [rank0]:2024-06-05 12:46:08,843 - root - INFO - step: 920 loss: 4.6427 memory: 59.48GiB(75.15%) wps: 2,632 mfu: 48.85% [rank0]:2024-06-05 12:46:40,052 - root - INFO - step: 930 loss: 4.6339 memory: 59.48GiB(75.15%) wps: 2,628 mfu: 48.78% [rank0]:2024-06-05 12:47:11,243 - root - INFO - step: 940 loss: 4.5964 memory: 59.48GiB(75.15%) wps: 2,631 mfu: 48.84% [rank0]:2024-06-05 12:47:42,655 - root - INFO - step: 950 loss: 4.6477 memory: 59.48GiB(75.15%) wps: 2,611 mfu: 48.47% [rank0]:2024-06-05 12:48:13,890 - root - INFO - step: 960 loss: 4.8137 memory: 59.48GiB(75.15%) wps: 2,626 mfu: 48.75% [rank0]:2024-06-05 12:48:45,110 - root - INFO - step: 970 loss: 4.5962 memory: 59.48GiB(75.15%) wps: 2,628 mfu: 48.78% [rank0]:2024-06-05 12:49:16,333 - root - INFO - step: 980 loss: 4.5450 memory: 59.48GiB(75.15%) wps: 2,627 mfu: 48.76% [rank0]:2024-06-05 12:49:47,561 - root - INFO - step: 990 loss: 4.5840 memory: 59.48GiB(75.15%) wps: 2,627 mfu: 48.76% [rank0]:2024-06-05 12:50:18,933 - root - INFO - step: 1000 loss: 4.5351 memory: 59.48GiB(75.15%) wps: 2,615 mfu: 48.53% [rank0]:2024-06-05 12:50:23,692 - root - INFO - Dumping traces at step 1000 [rank0]:2024-06-05 12:50:24,041 - root - INFO - Finished dumping traces in 0.35 seconds [rank0]:2024-06-05 12:50:24,422 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:2024-06-05 12:50:26,424 - root - INFO - Training completed **Fused = True** [rank0]:2024-06-05 14:55:42,894 - root - INFO - Finished dumping traces in 0.30 seconds [rank0]:2024-06-05 14:56:13,582 - root - INFO - step: 910 loss: 4.6091 memory: 59.48GiB(75.15%) wps: 2,341 mfu: 43.46% [rank0]:2024-06-05 14:56:43,765 - root - INFO - step: 920 loss: 4.6468 memory: 59.48GiB(75.15%) wps: 2,718 mfu: 50.45% [rank0]:2024-06-05 14:57:13,971 - root - INFO - step: 930 loss: 4.6365 memory: 59.48GiB(75.15%) wps: 2,715 mfu: 50.40% [rank0]:2024-06-05 14:57:44,172 - root - INFO - step: 940 loss: 4.6021 memory: 59.48GiB(75.15%) wps: 2,716 mfu: 50.41% [rank0]:2024-06-05 14:58:14,353 - root - INFO - step: 950 loss: 4.6522 memory: 59.48GiB(75.15%) wps: 2,718 mfu: 50.45% [rank0]:2024-06-05 14:58:44,536 - root - INFO - step: 960 loss: 4.8163 memory: 59.48GiB(75.15%) wps: 2,717 mfu: 50.44% [rank0]:2024-06-05 14:59:14,683 - root - INFO - step: 970 loss: 4.6026 memory: 59.48GiB(75.15%) wps: 2,721 mfu: 50.51% [rank0]:2024-06-05 14:59:44,840 - root - INFO - step: 980 loss: 4.5491 memory: 59.48GiB(75.15%) wps: 2,720 mfu: 50.49% [rank0]:2024-06-05 15:00:15,009 - root - INFO - step: 990 loss: 4.5859 memory: 59.48GiB(75.15%) wps: 2,719 mfu: 50.47% [rank0]:2024-06-05 15:00:45,228 - root - INFO - step: 1000 loss: 4.5396 memory: 59.48GiB(75.15%) wps: 2,714 mfu: 50.38% [rank0]:2024-06-05 15:00:49,455 - root - INFO - Dumping traces at step 1000 [rank0]:2024-06-05 15:00:49,756 - root - INFO - Finished dumping traces in 0.30 seconds [rank0]:2024-06-05 15:00:50,336 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:2024-06-05 15:00:52,339 - root - INFO - Training completed ```

…on (pytorch#386) # Summary Updates the behavior to call foreach when we are not using fused for the optimizer

fix BC issues There's another pipeline bc issue :(

ghstack-source-id: ac3501485faa093c8b9daacca9917805e2a987b7 Pull Request resolved: pytorch#389

…st badges ghstack-source-id: f198ee40b0d7ee9409feb8fb9539a73b822d756c Pull Request resolved: pytorch#390

forgot to enable tracer for tracer test in the last PR ghstack-source-id: 1cb137911f88daa47b57757346dad55aa736429e Pull Request resolved: pytorch#362

logits=(bs, seq_len, vocab_size). call `del logits` to free it before backward <img width="1607" alt="Screenshot 2024-06-12 at 11 10 36 AM" src="https://github.com/pytorch/torchtitan/assets/134637289/82db2792-59a3-40c4-9591-842be3dd9284">

small update for contributing.md to include what packages to install and how to lint.

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI or trains that does not use fp8

as titled, SAC moved to a different public API, move to the new API to avoid CI breaking

see pytorch#397

Summary This PR enables the use of TritonFusedRMSNorm with Tensor Parallel with 7%-8% performance gain compared to RMSNorm with TP. pytorch#364

ghstack-source-id: ce4a5b0b6b785ce595487c9d565a8af030c9d07b Pull Request resolved: pytorch#398

ghstack-source-id: fc8e221b5047337f59dea31f2c51d6168fe4fe88 Pull Request resolved: pytorch#402

- make it possible to choose flavor per-test from test_runner.py This is useful for PP when more layers == more possibilities for schedules/num_stages, but we don't care about having a large model in terms of #parameters ghstack-source-id: fd3076ad591b4f51dd195a78bab5dbe2e4276b18 Pull Request resolved: pytorch#403

When using pipeline parallelism, a common technique for reducing bubble size is to use schedules that specify more than one model chunk per physical rank. e.g. pp degree 4 could have 8 pipeline stages, and rank 0 could have stage 0 and stage 4. To generalize this concept without forking too much code in train.py, I make 'model_parts' a new container that either contains one model for non-PP or simple PP cases, and contains multiple model parts for complex PP cases. In general, this is tractable becuase we treat each model part the same: we create one optimizer per model part, and one lr scheduler per optimizer. We apply spmd and compile individually to each model part. The general pattern is to loop over the model parts and perform an action on each part, which also works fine if the list size is 1. The rest of train.py and optimizer/lr_scheduler changes add syntax sugar to simplify calling a method on each model part or optimizer part. ghstack-source-id: fd2982baae0cbeb5dcb695ef6509b7eec3299d6b Pull Request resolved: pytorch#406

when setting `enable_memory_snapshot = true` in `.toml` * dump memory snapshots in case of OOMs. output folder is `memory_snapshot/iteration_x_exit` * dump regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x` * existing `.toml` works since `enable_memory_snapshot=False` by default snapshot is an example of the dump when OOM happens <img width="1640" alt="Screenshot 2024-06-12 at 9 26 53 PM" src="https://github.com/pytorch/torchtitan/assets/134637289/6420799c-ae68-4b35-b8bb-f5b6ab3dd053">

train.py renamed `model` to `whole_model` pytorch#406 fp8 still use `model` thus report error on `model not defined`. this PR fixed it `build_fp8_linear(whole_model, job_config)`

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: pytorch#358

ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030 Pull Request resolved: pytorch#419

ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60 Pull Request resolved: pytorch#418

ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d Pull Request resolved: pytorch#425

ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454 Pull Request resolved: pytorch#426

ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187 Pull Request resolved: pytorch#430

ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf Pull Request resolved: pytorch#429

ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4 Pull Request resolved: pytorch#435

ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234 Pull Request resolved: pytorch#438

…#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags:

ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f Pull Request resolved: pytorch#444

ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422 Pull Request resolved: pytorch#445

ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f Pull Request resolved: pytorch#446

ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d Pull Request resolved: pytorch#447

ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3 Pull Request resolved: pytorch#442

ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e Pull Request resolved: pytorch#448

ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5 Pull Request resolved: pytorch#449

libuv is now enabled by default. we can proably do without the educational blurb there, and don't need the env either since the default has landed. ghstack-source-id: 68c8d2abe7eb0777e2add8df7634367c31b7ec06 Pull Request resolved: pytorch#453

Just a little code motion but it looks cleaner to me this way ghstack-source-id: 055fbd557cd9cf189e6b9bd6a7048f1204e1dc5c Pull Request resolved: pytorch#454

Make each script simpler to read ghstack-source-id: ba3aa65feb6e304736c73daf5bc8ab5fb254f196 Pull Request resolved: pytorch#455

This argument seems to be left over from older times- it is not used anywhere in the codebase. ghstack-source-id: abbcf82ed4d1b8fbb71c6a6b48acbc1296dbec64 Pull Request resolved: pytorch#456

ghstack-source-id: 522aa2fa0bf1679f55d9f3a8a38fdcd319d5e3df Pull Request resolved: pytorch#457

we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files

… CI (pytorch#464) make sure to only import float8_experimental when fp8 is enabled for 4 gpu CI, make sure we can import float8_experimental correctly in CI `python -m pip install git+https://github.com/pytorch-labs/float8_experimental.git`

skip fp8 tests on non-H100 GPUs by checking `torch.cuda.get_device_capability() >= (9, 0)` this makes 4 GPU CI healthy again

Summary: 1. standardizes on `float8` instead of `fp8` for config names 2. removes usage of non-public objects such as `Float8Linear` Test Plan: ``` with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear ``` Reviewers: Subscribers: Tasks: Tags:

Summary: Address the comments in pytorch#319 and resubmit the PR to fit the current code base. Test Plan: ``` CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600 --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000 ``` ghstack-source-id: 81dc85d42df13df4ed727bebd825681879af936b Pull Request resolved: pytorch#432

fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129">

@awgu

…ytorch#469) **keep model.output as nn.Linear**: it's a common practice to NOT apply fp8 on final output layer * specify `skip_fqn_list` in swapping * when applying TP to model.output, use plain `ColwiseParallel` instead of `Float8ColwiseParallel` credit to @awgu, we do not need tokentizer vacab size to be divisible by 16 pytorch#461 1D TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4` 1D TP + float8 all-gather, compile mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 --training.compile` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2 --training.compile` 1D TP + float8 all-gather trace: see float8 and all-gather in the trace <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM" src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472"> 2D + float8 all-gather trace: see float8 and FSDP collectives and TP collectives <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM" src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7">

per discussion from pytorch#469 (comment) we are planning BC breaking changes in float8_experimental. remove CI for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we can discuss bringing it back

…ort, enhance async tp UX (pytorch#471) This PR adds some enhancements for supporting async tp: 1 - if async tp is active, auto updates the torch.dynamo cache limit to 10K. If this is not updated, async tp will not be activated on larger models as it will quietly stop compilation due to 'cache limit reached' with no info for the user. This config update is logged. 2 - if async tp is enabled, verifies that torch.compile is set to true for this job config. If not, it warns and then activates torch.compile to ensure user gets working async tp. (see WARNING in below screenshot) <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM" src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d"> 3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied Async Tensor Parallel' when async tp is active to make it clear in the logs which TP is active. (see above screenshot)

DCP recently added safeties to avoid using it for 2D/3D since strided sharding (a feature needed for safe 2D/3D resharding) is not ready yet. PP uses DCP to load a seed checkpoint. Disabling the safety mechanism is enough to make 3D/PP still work (for the case where we train from the beginning or do not re-shard. (Resharding refers to saving a checkpoint from one world size/parallelism config and loading/resuming under a different one). ghstack-source-id: c069d2186c79517c72f5b3c99485cebdc15df08f Pull Request resolved: pytorch#460

Summary: float8_experimental landed various BC-breaking UX changes last week. This PR updates torchtitan to work with the version of float8_experimental after pytorch-labs/float8_experimental#332 and pytorch-labs/float8_experimental#337 Test Plan: ``` with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags:

ghstack-source-id: 8344603f7a5596cb2909c9bf04dd1b9e4730c9b8 Pull Request resolved: pytorch#485

ghstack-source-id: 12c4418b0574d93e1441f4ca3d1de79c8aad7a40 Pull Request resolved: pytorch#487

pytorch#491) As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786">

Summary: Adds config options to configure float8 scaling type for input, weight, grad_output. Performance is not ideal yet, but that's because we have not optimized it. Test Plan: ``` // repeat for input, weight, grad_out with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile ``` Reviewers: Subscribers: Tasks: Tags:

This was approved in pytorch#490, but merged into the wrong branch, merging this into main

Summary: The `float8_experimental` repository moved to `torchao.float8` in pytorch/ao#551 This PR updates `torchtitan` to use float8 from the new location. Test Plan: ``` with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags:

ghstack-source-id: 3879e764e7b33afde5d778810c71d1d2a8f82f6d Pull Request resolved: pytorch#494

ghstack-source-id: 17a1ee9f03f13423a30183c5c8d7ad30f8c8dbfc Pull Request resolved: pytorch#495

ghstack-source-id: e94c7f6f4fad87c5432262c54beabd02de5541b8 Pull Request resolved: pytorch#496

With the official launch of LLaMa 3.1 model, we want to add the config to TorchTitan. Of course, there are more work to be done, but we want to go an incremental way. So more PRs will be needed. For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The perf number is wps: 109 mfu: 29%. Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4). <img width="1037" alt="image" src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e"> Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4). ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0)

ghstack-source-id: 587e3d6e5270714ca734b8031ce41a962e6394ea Pull Request resolved: pytorch#498

ghstack-source-id: 63af8025c184fd5ad34f2f57bf78a37dda2cd33d Pull Request resolved: pytorch#443

As title, use `8e-5` rather than `0.8e-4`.

ghstack-source-id: 5ebb4adf3152f413fa33a923c272c9aa3ce1f775 Pull Request resolved: pytorch#499

ghstack-source-id: 52ed6836de39e82c4c5824a40ecfc1d9ec7ed2bd Pull Request resolved: pytorch#500

as titled, add warning to compile rmsnorm as it's not fully ready yet, i.e. this issue pytorch#497 We can remove this warning once we fix the issue

add float8 link in README so we can redirect people from dev-discuss post to torchtitan repo README looks like this after rendering <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM" src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4"> float8.md looks like this <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM" src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4"> I tried the command locally and traces are looking good <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM" src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb">

ghstack-source-id: 2927f0a8082171da3e9f59a5d04f8325cbdf3653 Pull Request resolved: pytorch#508

ghstack-source-id: 003bfbfbcf1511ddbd18e15d031b39f597d8e7db Pull Request resolved: pytorch#510

… entries ghstack-source-id: 319f4961b092778703101b98937803073132afa1 Pull Request resolved: pytorch#512

Explain the rationale and challenges behind certain changes we made to llama model to support 3D parallelism. --------- Co-authored-by: tianyu-l <[email protected]>

ghstack-source-id: 1965d3122885fed3c28e2e058c55581187e7816c Pull Request resolved: pytorch#513

@wconstab

…ch#516) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * pytorch#473 * pytorch#517 * __->__ pytorch#516 Allows PP to be used without a seed checkpoint by calling `init_weight` on each model part. This is the solution in step 1 of pytorch#514 proposed by @wconstab

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * pytorch#473 * __->__ pytorch#517 * pytorch#516 Ran `pre-commit run --all-files`

…ng DTensor strided sharding (pytorch#507) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#507 **Summary** 1. check if users are using new nightly-build pytorch that includes DTensor strided sharding (pytorch/pytorch#130760) when 2D/3D is used. Print warning if not. 2. remove temporary re-enablement added in pytorch#460 . **Test** Command: `python test_runner.py outputs --test pp_dp_tp --ngpu 8` GPUs: A100 Output: - without strided sharding: ``` [rank7]:2024-08-06 03:21:26,706 - root - INFO - step: 2 loss: 8.1652 memory: 0.51GiB(0.64%) wps: 8,250 mfu: 0.25% [rank7]:2024-08-06 03:21:27,013 - root - INFO - step: 3 loss: 8.0951 memory: 0.51GiB(0.64%) wps: 13,358 mfu: 0.41% [rank7]:2024-08-06 03:21:27,309 - root - INFO - step: 4 loss: 7.9748 memory: 0.51GiB(0.64%) wps: 13,865 mfu: 0.42% [rank7]:2024-08-06 03:21:27,582 - root - INFO - step: 5 loss: 7.8025 memory: 0.51GiB(0.64%) wps: 15,057 mfu: 0.46% [rank7]:2024-08-06 03:21:28,076 - root - INFO - step: 6 loss: 7.5612 memory: 0.51GiB(0.64%) wps: 8,300 mfu: 0.25% [rank7]:2024-08-06 03:21:28,608 - root - INFO - step: 7 loss: 7.3649 memory: 0.51GiB(0.64%) wps: 7,705 mfu: 0.23% [rank7]:2024-08-06 03:21:28,927 - root - INFO - step: 8 loss: 7.2946 memory: 0.51GiB(0.64%) wps: 12,832 mfu: 0.39% [rank7]:2024-08-06 03:21:29,251 - root - INFO - step: 9 loss: 7.1311 memory: 0.51GiB(0.64%) wps: 12,669 mfu: 0.38% [rank7]:2024-08-06 03:21:29,627 - root - INFO - step: 10 loss: 7.0540 memory: 0.51GiB(0.64%) wps: 10,918 mfu: 0.33% >>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<< [rank7]:2024-08-06 03:21:59,723 - root - INFO - step: 11 loss: 7.0822 memory: 0.51GiB(0.64%) wps: 1,139 mfu: 0.03% [rank7]:2024-08-06 03:22:00,054 - root - INFO - step: 12 loss: 7.0508 memory: 0.51GiB(0.64%) wps: 12,366 mfu: 0.38% [rank7]:2024-08-06 03:22:00,340 - root - INFO - step: 13 loss: 6.9182 memory: 0.51GiB(0.64%) wps: 14,370 mfu: 0.44% [rank7]:2024-08-06 03:22:00,624 - root - INFO - step: 14 loss: 6.8948 memory: 0.51GiB(0.64%) wps: 14,442 mfu: 0.44% [rank7]:2024-08-06 03:22:00,907 - root - INFO - step: 15 loss: 6.8358 memory: 0.51GiB(0.64%) wps: 14,514 mfu: 0.44% [rank7]:2024-08-06 03:22:01,574 - root - INFO - step: 16 loss: 6.7653 memory: 0.51GiB(0.64%) wps: 6,144 mfu: 0.19% [rank7]:2024-08-06 03:22:02,209 - root - INFO - step: 17 loss: 6.7340 memory: 0.51GiB(0.64%) wps: 6,453 mfu: 0.20% [rank7]:2024-08-06 03:22:02,532 - root - INFO - step: 18 loss: 6.6874 memory: 0.51GiB(0.64%) wps: 12,695 mfu: 0.39% [rank7]:2024-08-06 03:22:02,863 - root - INFO - step: 19 loss: 6.6566 memory: 0.51GiB(0.64%) wps: 12,406 mfu: 0.38% [rank7]:2024-08-06 03:22:03,257 - root - INFO - step: 20 loss: 6.6629 memory: 0.51GiB(0.64%) wps: 10,392 mfu: 0.32% ``` - with strided sharding ``` [rank7]:2024-08-06 03:26:18,288 - root - INFO - step: 1 loss: 8.2069 memory: 0.50GiB(0.63%) wps: 915 mfu: 0.03% [rank7]:2024-08-06 03:26:19,084 - root - INFO - step: 2 loss: 8.1913 memory: 0.51GiB(0.64%) wps: 5,144 mfu: 0.16% [rank7]:2024-08-06 03:26:19,365 - root - INFO - step: 3 loss: 8.1148 memory: 0.51GiB(0.64%) wps: 14,593 mfu: 0.44% [rank7]:2024-08-06 03:26:19,698 - root - INFO - step: 4 loss: 7.9982 memory: 0.51GiB(0.64%) wps: 12,328 mfu: 0.37% [rank7]:2024-08-06 03:26:20,011 - root - INFO - step: 5 loss: 7.8382 memory: 0.51GiB(0.64%) wps: 13,100 mfu: 0.40% [rank7]:2024-08-06 03:26:20,498 - root - INFO - step: 6 loss: 7.6293 memory: 0.51GiB(0.64%) wps: 8,423 mfu: 0.26% [rank7]:2024-08-06 03:26:21,126 - root - INFO - step: 7 loss: 7.4454 memory: 0.51GiB(0.64%) wps: 6,530 mfu: 0.20% [rank7]:2024-08-06 03:26:21,472 - root - INFO - step: 8 loss: 7.3337 memory: 0.51GiB(0.64%) wps: 11,843 mfu: 0.36% [rank7]:2024-08-06 03:26:21,849 - root - INFO - step: 9 loss: 7.1960 memory: 0.51GiB(0.64%) wps: 10,892 mfu: 0.33% [rank7]:2024-08-06 03:26:22,229 - root - INFO - step: 10 loss: 7.1208 memory: 0.51GiB(0.64%) wps: 10,798 mfu: 0.33% >>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<< [rank7]:2024-08-06 03:26:50,306 - root - INFO - step: 11 loss: 7.1222 memory: 0.51GiB(0.64%) wps: 866 mfu: 0.03% [rank7]:2024-08-06 03:26:50,632 - root - INFO - step: 12 loss: 7.1189 memory: 0.51GiB(0.64%) wps: 12,589 mfu: 0.38% [rank7]:2024-08-06 03:26:50,917 - root - INFO - step: 13 loss: 6.9646 memory: 0.51GiB(0.64%) wps: 14,417 mfu: 0.44% [rank7]:2024-08-06 03:26:51,217 - root - INFO - step: 14 loss: 6.9626 memory: 0.51GiB(0.64%) wps: 13,680 mfu: 0.42% [rank7]:2024-08-06 03:26:51,514 - root - INFO - step: 15 loss: 6.8694 memory: 0.51GiB(0.64%) wps: 13,799 mfu: 0.42% [rank7]:2024-08-06 03:26:52,207 - root - INFO - step: 16 loss: 6.7994 memory: 0.51GiB(0.64%) wps: 5,910 mfu: 0.18% [rank7]:2024-08-06 03:26:53,053 - root - INFO - step: 17 loss: 6.7634 memory: 0.51GiB(0.64%) wps: 4,847 mfu: 0.15% [rank7]:2024-08-06 03:26:53,370 - root - INFO - step: 18 loss: 6.7233 memory: 0.51GiB(0.64%) wps: 12,915 mfu: 0.39% [rank7]:2024-08-06 03:26:53,686 - root - INFO - step: 19 loss: 6.7054 memory: 0.51GiB(0.64%) wps: 12,995 mfu: 0.39% [rank7]:2024-08-06 03:26:54,059 - root - INFO - step: 20 loss: 6.7130 memory: 0.51GiB(0.64%) wps: 10,991 mfu: 0.33% ```

`torch.nn.Module.to_empty` takes keyword only arg of "device" according to https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to_empty

ghstack-source-id: 7e1c7071f8126072ab0e25194b75f280bf4277ec Pull Request resolved: pytorch#523

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * pytorch#473 * __->__ pytorch#522

ghstack-source-id: c8f611742ffbb4859988b97e706b9e0d1b4ad6f1 Pull Request resolved: pytorch#521

Sync with torchtitan #2

Sync with torchtitan #2

Commits on Feb 24, 2024

Commits on Feb 26, 2024

Commits on Feb 27, 2024

Commits on Feb 28, 2024

Commits on Feb 29, 2024

Commits on Mar 1, 2024

Commits on Mar 2, 2024

Commits on Mar 5, 2024

Commits on Mar 6, 2024

Commits on Mar 7, 2024

Commits on Mar 9, 2024

Commits on Mar 12, 2024

Commits on Mar 13, 2024

Commits on Mar 14, 2024

Commits on Mar 15, 2024

Commits on Mar 20, 2024

Commits on Mar 21, 2024

Commits on Mar 22, 2024

Commits on Mar 24, 2024

Commits on Mar 25, 2024

Commits on Mar 27, 2024

Commits on Mar 28, 2024

Commits on Mar 29, 2024

Commits on Apr 2, 2024

Commits on Apr 3, 2024

Commits on Apr 4, 2024

Commits on Apr 5, 2024

Commits on Apr 10, 2024

Commits on Apr 11, 2024

Commits on Apr 12, 2024

Commits on Apr 15, 2024

Commits on Apr 16, 2024

Commits on Apr 17, 2024

Commits on Apr 18, 2024

Commits on Apr 20, 2024

Commits on Apr 22, 2024

Commits on Apr 23, 2024

Commits on Apr 24, 2024

Commits on Apr 25, 2024

Commits on Apr 26, 2024

Commits on Apr 29, 2024

Commits on Apr 30, 2024

Commits on May 1, 2024

Commits on May 2, 2024

Commits on May 3, 2024

Commits on May 7, 2024

Commits on May 8, 2024

Commits on May 10, 2024

Commits on May 13, 2024

Commits on May 15, 2024

Commits on May 16, 2024

Commits on May 17, 2024

Commits on May 21, 2024

Commits on May 22, 2024

Commits on May 23, 2024

Commits on May 24, 2024

Commits on May 29, 2024

Commits on May 30, 2024

Commits on May 31, 2024

Commits on Jun 3, 2024

Commits on Jun 4, 2024

Commits on Jun 5, 2024

Commits on Jun 6, 2024

Commits on Jun 7, 2024

Commits on Jun 9, 2024

Commits on Jun 10, 2024

Commits on Jun 11, 2024

Commits on Jun 12, 2024

Commits on Jun 13, 2024

Commits on Jun 14, 2024

Commits on Jun 17, 2024

Commits on Jun 18, 2024

Commits on Jun 19, 2024

Commits on Jun 20, 2024

Commits on Jun 21, 2024

Commits on Jun 25, 2024

Commits on Jun 26, 2024

Commits on Jun 27, 2024