-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync with torchtitan #2
Commits on Feb 24, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 3d1e9ea - Browse repository at this point
Copy the full SHA 3d1e9eaView commit details -
move config folder to root and adjust options (pytorch#83)
as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations
Configuration menu - View commit details
-
Copy full SHA for 98a0f79 - Browse repository at this point
Copy the full SHA 98a0f79View commit details
Commits on Feb 26, 2024
-
add iter time tracking via cuda events, add data loading times, add c…
…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87) This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing). 1. tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info. 2. iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs). 3. 'time' is now displayed at each iter along with the usual loss and lr. 4. at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed. 5. based on @tianyu-l feedback: I have added data loading times as well. I used the same timeit.default_timer() from timeit to be consistent. (cpu side so no synch's needed :) 6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates: Now: <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66"> before: <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">
Configuration menu - View commit details
-
Copy full SHA for 629652b - Browse repository at this point
Copy the full SHA 629652bView commit details -
Fill missing options in toml file wih argparse defaults (pytorch#91)
Summary: Summary: Follow up on config unification, options not available in config file are picked from command line defaults. Test Plan: ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-8.0.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache rootdir: /data/users/gnadathur/a/torchtrain configfile: pyproject.toml plugins: cov-4.1.0 collecting ... collected 3 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 66%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [100%] ---------- coverage: platform linux, python 3.10.13-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist ============================== 3 passed in 0.06s =============================== Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c866a64 - Browse repository at this point
Copy the full SHA c866a64View commit details
Commits on Feb 27, 2024
-
support infinite loop over alpaca dataset
ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22 Pull Request resolved: pytorch#92
Configuration menu - View commit details
-
Copy full SHA for 78a1643 - Browse repository at this point
Copy the full SHA 78a1643View commit details -
Add color to console output if local logging, auto avoid color loggin…
…g on slurm (pytorch#93) This PR adds the ability to do colored console outputs in order to highlight the training data outputs. It also adds a check to not use this color formatting on slurm, where it will add 33= instead of the color if not avoided. Note that I've just added some color to highlight the main training data. Users that fork/clone can use it to enhance their outputs as desired. <img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6"> Note that on slurm it remains plain: <img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2"> if you dont' check this, then it would otherwise look like this (this does not happen with this PR, just showing if we didn't check and credit to Yifu for noting this would be an issue): <img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">
Configuration menu - View commit details
-
Copy full SHA for 6d9e4e6 - Browse repository at this point
Copy the full SHA 6d9e4e6View commit details -
update GPU metrics logging to GiB (gibibytes) (pytorch#95)
this PR updates the GPU metrics to labelling as GiB - we were calculating GiB but calling it GB. (credit to @awgu for flagging this - issue pytorch#94) function names and member vars in metrics.py have been updated to _gib instead of _gb for clarity, and the logging output now labels as GiB: <img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">
Configuration menu - View commit details
-
Copy full SHA for e987ac3 - Browse repository at this point
Copy the full SHA e987ac3View commit details -
improve TensorBoard instructions in README
ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7 Pull Request resolved: pytorch#96
Configuration menu - View commit details
-
Copy full SHA for 62ff09d - Browse repository at this point
Copy the full SHA 62ff09dView commit details
Commits on Feb 28, 2024
-
Enable libUV for torchtrain (pytorch#98)
Enable libUV for torchtrain. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** [rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama [rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912. [rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep: 1 �[32mloss: 10.9229 �[39miter: �[34m 1.9386�[39m data: �[34m0.0368 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep: 2 �[32mloss: 10.8673 �[39miter: �[34m 0.0562�[39m data: �[34m0.0316 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep: 3 �[32mloss: 10.7145 �[39miter: �[34m 0.0523�[39m data: �[34m0.0322 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep: 4 �[32mloss: 10.5038 �[39miter: �[34m 0.0559�[39m data: �[34m0.0319 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep: 5 �[32mloss: 10.2228 �[39miter: �[34m 0.0537�[39m data: �[34m0.031 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep: 6 �[32mloss: 9.9677 �[39miter: �[34m 0.0562�[39m data: �[34m0.0302 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep: 7 �[32mloss: 9.7762 �[39miter: �[34m 0.0544�[39m data: �[34m0.0317 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep: 8 �[32mloss: 9.4359 �[39miter: �[34m 0.0509�[39m data: �[34m0.0322 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep: 9 �[32mloss: 9.2326 �[39miter: �[34m 0.1007�[39m data: �[34m0.0321 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10 �[32mloss: 9.1641 �[39miter: �[34m 0.0971�[39m data: �[34m0.031 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds [rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` --------- Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 60f6b0d - Browse repository at this point
Copy the full SHA 60f6b0dView commit details
Commits on Feb 29, 2024
-
use warmup steps for lr scheduler, ban steps == -1 (pytorch#99)
as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler
Configuration menu - View commit details
-
Copy full SHA for 7acab70 - Browse repository at this point
Copy the full SHA 7acab70View commit details -
Add llama 7B config (pytorch#100)
Add 7b config and adjust options to be more realistic didn't add this to the train scripts as default as it's expensive to init, whoever use it can adjust it accordingly
Configuration menu - View commit details
-
Copy full SHA for d5c27a9 - Browse repository at this point
Copy the full SHA d5c27a9View commit details -
add selective activation checkpointing
ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97
Configuration menu - View commit details
-
Copy full SHA for 2c8cec2 - Browse repository at this point
Copy the full SHA 2c8cec2View commit details
Commits on Mar 1, 2024
-
Add job description field in toml (pytorch#101)
Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 452baee - Browse repository at this point
Copy the full SHA 452baeeView commit details
Commits on Mar 2, 2024
-
fix 2D parallel crash caused by all-reduce on 2D world_mesh
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9 Pull Request resolved: pytorch#105
Configuration menu - View commit details
-
Copy full SHA for eb3fdd0 - Browse repository at this point
Copy the full SHA eb3fdd0View commit details
Commits on Mar 5, 2024
-
Load missing keys default from argparse (pytorch#111)
``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** [rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training [rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama [rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model... [rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled. [rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701. [rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep: 1 �[32mloss: 10.8424 �[39miter: �[34m 1.8688�[39m data: �[34m0.0316 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep: 2 �[32mloss: 10.7581 �[39miter: �[34m 0.0476�[39m data: �[34m0.0357 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep: 3 �[32mloss: 10.6239 �[39miter: �[34m 0.045�[39m data: �[34m0.0333 �[39mlr: �[33m0.0008�[39m [rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep: 4 �[32mloss: 10.4163 �[39miter: �[34m 0.0455�[39m data: �[34m0.0323 �[39mlr: �[33m0.0007�[39m [rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep: 5 �[32mloss: 10.1529 �[39miter: �[34m 0.0459�[39m data: �[34m0.032 �[39mlr: �[33m0.0006�[39m [rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep: 6 �[32mloss: 9.8899 �[39miter: �[34m 0.0468�[39m data: �[34m0.0311 �[39mlr: �[33m0.0005�[39m [rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep: 7 �[32mloss: 9.7204 �[39miter: �[34m 0.0461�[39m data: �[34m0.0312 �[39mlr: �[33m0.0004�[39m [rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep: 8 �[32mloss: 9.3757 �[39miter: �[34m 0.0457�[39m data: �[34m0.0319 �[39mlr: �[33m0.0003�[39m [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep: 9 �[32mloss: 9.1883 �[39miter: �[34m 0.0762�[39m data: �[34m0.0318 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10 �[32mloss: 9.1212 �[39miter: �[34m 0.0808�[39m data: �[34m0.0319 �[39mlr: �[33m0.0001�[39m [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2682144 - Browse repository at this point
Copy the full SHA 2682144View commit details -
Add meta_init, enable it as default init process (pytorch#84)
This PR enables meta_init functionality to avoid OOM'ing on cpu for larger models. The core functionality is in meta_init.py, and a few changes in parallelization and train.py. Key items: 1 - this is largely the same as the earlier PR I had for meta_init, but I did a new one b/c faster than reworking it with all the interim changes. 2 - to address feedback in previous PR: a - why do we need meta_init.py, can't we just do: ~~~ with torch.device("meta"): model = Model.from_args(...) ~~~ Unfortunately this does not work b/c the rope embeddings are treated differently (buffer) and thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings and the model will fail on first forward. This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened pytorch#110 to track this and investigate while not holding up meta init that is working from landing. b - per earlier feedback - meta init is now 'not optional' but simply the default. This should ensure all models leverage it and ensure we aren't missing things for future meta_init aspects. 3 - misc change - I switched the model_params to just do the normal all params count instead of 'unique params' b/c it does not mesh with what people perceive model size as. Testing: tested both debugmodel and 26B model with and without meta init to confirm same loss curves. Note for future reference - if you get a bad init (meta init failure) you will simply not train (loss is same every iter). If you fail to call reset params after FSDP, then you will train (b/c we default to torch.randn_like) but your starting loss will be 5x+ higher (telling you that you have not properly init'ed the model).
Configuration menu - View commit details
-
Copy full SHA for afbf62a - Browse repository at this point
Copy the full SHA afbf62aView commit details -
Fix feedback from PR 111 (pytorch#113)
Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f91f97a - Browse repository at this point
Copy the full SHA f91f97aView commit details
Commits on Mar 6, 2024
-
ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d Pull Request resolved: pytorch#114
Configuration menu - View commit details
-
Copy full SHA for 1a180ee - Browse repository at this point
Copy the full SHA 1a180eeView commit details -
ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c Pull Request resolved: pytorch#112
Configuration menu - View commit details
-
Copy full SHA for ed04380 - Browse repository at this point
Copy the full SHA ed04380View commit details -
Configuration menu - View commit details
-
Copy full SHA for 41f5172 - Browse repository at this point
Copy the full SHA 41f5172View commit details
Commits on Mar 7, 2024
-
add miniPile dataset for pretraining, 1M entries (solves the 'out of …
…data' at 40 iters issue) (pytorch#88) This PR add's minipile (1M, 6GB) dataset as an option for pretraining with torchtrain. It resolves the issue where we run out of data after 40 iterations with the default alpaca dataset. Per @tianyu-l's excellent suggestion, have refactored to have a single hf_datasets.py file that supports both minipile and alpaca since it turned out no need for any different tokenizer, etc. Also cleaned up the datasets package so that create_tokenizer is exposed directly, and thus all public apis can be used directly from torchtrain.datasets. Lastly - added warning if/when a dataset is being re-looped so users don't get burned by overfitting: <img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a"> Adds a color highlight to showcase what dataloader was built: <img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0"> and <img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e"> Usage: just add "minipile" or "alpaca" as the dataset in the training config toml file. <img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27"> Testing: verified training loss is improving and ran for 100 iters to verify no issue with out of data any longer with minipile. reran with alpaca and saw the expected out of data at 40 iters without infinite loop option, runs to 100 with infinite. Notes: I did not make this a default dataset since for debugmodel, mostly running 10 iters is fine and there's 6GB to pull down. <img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">
Configuration menu - View commit details
-
Copy full SHA for 680f1aa - Browse repository at this point
Copy the full SHA 680f1aaView commit details -
add data loading option to load from local file system
ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6 Pull Request resolved: pytorch#117
Configuration menu - View commit details
-
Copy full SHA for 85263f7 - Browse repository at this point
Copy the full SHA 85263f7View commit details
Commits on Mar 9, 2024
-
ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad Pull Request resolved: pytorch#121
Configuration menu - View commit details
-
Copy full SHA for 3c51744 - Browse repository at this point
Copy the full SHA 3c51744View commit details -
ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8 Pull Request resolved: pytorch#122
Configuration menu - View commit details
-
Copy full SHA for 649cf0b - Browse repository at this point
Copy the full SHA 649cf0bView commit details -
set betas and weight decay for optimizers
according to suggestions in pytorch#118 (comment) ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651 Pull Request resolved: pytorch#123
Configuration menu - View commit details
-
Copy full SHA for ab05f66 - Browse repository at this point
Copy the full SHA ab05f66View commit details -
Add c4 dataset (177M, streaming), update multi-node support for lates…
…t job configs (pytorch#124) This PR: 1 - adds the english language portion of c4 dataset, which has 177M entries. (https://huggingface.co/datasets/allenai/c4) Due to the size, streaming is enabled as the default. This is the allen-ai/c4, as apparently the original c4 is being deprecated and HF advises to use allen-ai now. For comparison per @tianyu-l request - 40 iterations avg time: alpaca cached loading: Average data load time: 0.0279 seconds c4 streaming loading: Average data load time: 0.0290 seconds There is a longer initial delay during the 'preparing c4' vs alpaca (i.e. 45 seconds vs 10 seconds), but after that speed is similar. Dataset sample (not displayed in training, just an excerpt I pulled to double check the data flow): <img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121"> 2 - I also updated the multi-node slurm file to account for the new job config. Test: verified no looping with 100 iterations, sampled data streamed to verify.
Configuration menu - View commit details
-
Copy full SHA for 66c196b - Browse repository at this point
Copy the full SHA 66c196bView commit details
Commits on Mar 12, 2024
-
Add openwebtext dataset for larger scale training without shuffling (p…
…ytorch#130) This PR adds the openwebtext 1M dataset. This is a homogenous dataset, so we are able to train successfully while not having any shuffle in our dataset loader. 1 - adds the dateset to hf_datasets 2 - makes the default dataset for 13b and 70b as openwebtext since that is the preferred choice for larger scale training. Testing - ran 5K iters (9 nodes) to verify no spiking issues: <img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">
Configuration menu - View commit details
-
Copy full SHA for 10229d6 - Browse repository at this point
Copy the full SHA 10229d6View commit details -
[TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading (…
…pytorch#131) This fix would temporarily unblock loading. So we won't run into the issue of: ``` [rank0]:[rank0]: train_state.losses.append(train_state.current_loss) [rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append' ``` However, current_loss and losses are still not correct, since by current setup, losses and current_losses would be different across different ranks. Also, we don't know the size of losses because this is based on the # of steps. So loading still work but the value of current_loss and losses are not being loaded correctly. I will follow up with further fixes.
Configuration menu - View commit details
-
Copy full SHA for 7fee3cf - Browse repository at this point
Copy the full SHA 7fee3cfView commit details
Commits on Mar 13, 2024
-
ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a Pull Request resolved: pytorch#132
Configuration menu - View commit details
-
Copy full SHA for 7cd2725 - Browse repository at this point
Copy the full SHA 7cd2725View commit details -
use SequenceParallel style in tp/sp (pytorch#133)
simplify things given we already have SequenceParallel style landed in main
Configuration menu - View commit details
-
Copy full SHA for 3161ffb - Browse repository at this point
Copy the full SHA 3161ffbView commit details
Commits on Mar 14, 2024
-
ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f Pull Request resolved: pytorch#137
Configuration menu - View commit details
-
Copy full SHA for e39ee7e - Browse repository at this point
Copy the full SHA e39ee7eView commit details -
disable verbose print from profiling
ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099 Pull Request resolved: pytorch#136
Configuration menu - View commit details
-
Copy full SHA for 5d18bf0 - Browse repository at this point
Copy the full SHA 5d18bf0View commit details -
add Selective layer activation checkpointing, single control for turn…
…ing AC on or off. (pytorch#125) This PR: 1 - adds selective layer checkpointing - this lets the user select every x layer to checkpoint: i.e. 2 = every other layer is checkpointed. spec for config was updated by Wanchao - so we now have this layout for AC which is hopefully self-explanatory (covers None, full, Selective Op or Selective Layer and layer filtering policy: <img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4"> Thus, it lets user toggle between traditional 'all layers' to more and more fine grained checkpointing. Note that I implemented this for IBM last summer and in their llama testing, every 2nd layer was the best bang/buck so I have made that the default. 2 - the config file has been updated to make a new [activation_checkpointing] section and make it easier to modify vs being dumped into the training section. Testing and results: I tested all the AC options to ensure all options are working, and that we fail if both types are set to true in config: <img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e">
Configuration menu - View commit details
-
Copy full SHA for 0d415d7 - Browse repository at this point
Copy the full SHA 0d415d7View commit details -
ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c Pull Request resolved: pytorch#134
Configuration menu - View commit details
-
Copy full SHA for cc2061a - Browse repository at this point
Copy the full SHA cc2061aView commit details
Commits on Mar 15, 2024
-
Shorten nccl comm timeout and enable flight recorder dumping (pytorch…
…#103) Timeout ------- It's convenient whether during iterative debugging or long running training to find out asap about a failure. The default timeout is way too long and leads to wasted cluster time or developer frustration. Timeout can be adjusted via cmdline or in .toml if it needs to be larger for a particular model. Another useful pattern can be to set a large timeout for initialization and then tighten it after iteration 1. We can add this later if desired. Ideally we could pass the timeout to the device mesh ctor, but it's not ready yet. Also, we can change timeouts of the existing PGs after creating them, but that's more LOC and not necessary unless we want to change the timeouts at runtime. Dumps ----- Dumping on timeout should be a safe default for everyone. It has the side-effect of requiring a dump path which defaults to ~/pgnccl_dump but can be overridden via DUMP_PATH env. The raw content of the dump is a pickle that is intended to be consumed through scripts/tools which are under development, so it may not be easy to know how to use these for now. As the tooling matures, we should provide reference docs and probably print out pointers in the logs when we perform the dump. Test plan: tested locally by adding a rank0 sleep for 10sec inside the training loop, validating all 8 ranks dumped a trace.
Configuration menu - View commit details
-
Copy full SHA for 3b3362b - Browse repository at this point
Copy the full SHA 3b3362bView commit details -
fix up gpu memory monitoring and logging
ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1 Pull Request resolved: pytorch#147
Configuration menu - View commit details
-
Copy full SHA for 9f5a56d - Browse repository at this point
Copy the full SHA 9f5a56dView commit details -
Separate timeout during init and training (pytorch#149)
Allow a tighter timeout during training than during init. Init includes the first train step, as well as any loading and setup. It can be slower and less predictable due to various factors including lazy initialization or jit compilation. After the first train step, we expect more predictable runtime and benefit from a tighter timeout to give quick feedback on a hang. Tested by pasting this code in 2 places ``` if dp_mesh.get_local_rank() == 0 and train_state.step == 1: import time time.sleep(10) ``` (a) before calling set_pg_timeout, which did not cause a timeout (b) after calling set_pg_timeout, which timed out
Configuration menu - View commit details
-
Copy full SHA for 9eb6a21 - Browse repository at this point
Copy the full SHA 9eb6a21View commit details
Commits on Mar 20, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 6485be9 - Browse repository at this point
Copy the full SHA 6485be9View commit details -
Refactor to clean up parallelisms/__init__.py
(second attempt, didn't land correctly) ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b Pull Request resolved: pytorch#154
Configuration menu - View commit details
-
Copy full SHA for fd4c75b - Browse repository at this point
Copy the full SHA fd4c75bView commit details -
enable gc control scheduling to help avoid stragglers (pytorch#148)
This PR adds control over Python garbage collection to help avoid stragglers during large scale training. updates - this feature is now exposed as a controllable option gc_schedule, with a default of 50. 0 = not enabled. int = schedules gc at every int iters during training loop. <img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f"> Effectively we disable the gc, run one collection to ensure a good starting point, and then at the start of each gc_schedule iter, we call gc to free up things. By enforcing a fixed schedule for collection, it helps all ranks stay more in synch. Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a perf boost of ~1.5% per iter just by virtue of better synch. (this was originally developed during dist compiler to resolve stragglers, I believe @fegin came up with this solution).
Configuration menu - View commit details
-
Copy full SHA for 93c2b7d - Browse repository at this point
Copy the full SHA 93c2b7dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9e7920f - Browse repository at this point
Copy the full SHA 9e7920fView commit details -
ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b Pull Request resolved: pytorch#151
Configuration menu - View commit details
-
Copy full SHA for e5d1b89 - Browse repository at this point
Copy the full SHA e5d1b89View commit details
Commits on Mar 21, 2024
-
disable buffer reuse for compile for now (pytorch#156)
disable buffer reuse for compile to have close numerics to eager mode, as suggested by @Chillee This is probably only a temp change until buff reuse fix in inductor
Configuration menu - View commit details
-
Copy full SHA for ceebd53 - Browse repository at this point
Copy the full SHA ceebd53View commit details
Commits on Mar 22, 2024
-
refactor config manager and support cmd overrides (pytorch#157)
This PR supports explicit cmd overrides, to allow infra layers to override certain options (the most important one is dump_folder)
Configuration menu - View commit details
-
Copy full SHA for 32aa083 - Browse repository at this point
Copy the full SHA 32aa083View commit details
Commits on Mar 24, 2024
-
Configuration menu - View commit details
-
Copy full SHA for a21645e - Browse repository at this point
Copy the full SHA a21645eView commit details
Commits on Mar 25, 2024
-
rename sequence_parallel to tensor_parallel (pytorch#162)
This PR renames sequence_parallel to tensor_parallel, as sequence parallel is only applied to rmsnorm layers, a broader name should be tensor_parallel, maybe with sequence_parallel enabled ghstack broken :( so using direct branch push instead
Configuration menu - View commit details
-
Copy full SHA for e28832e - Browse repository at this point
Copy the full SHA e28832eView commit details
Commits on Mar 27, 2024
-
add basic AC configs for 13B and 70B (pytorch#169)
as titled, currently 13B use selective op, and 70B use selective layer, we can do some more experiments and adjust the configs later
Configuration menu - View commit details
-
Copy full SHA for 6722657 - Browse repository at this point
Copy the full SHA 6722657View commit details -
[TorchTrain][Checkpoint] Update train state to include global_avg_los…
…ses and global_max_losses (pytorch#167) Based on discussion with @tianyu-l, we decided to only checkpoint `global_avg_losses` and `global_max_losses` per log frequency iteration to avoid all_reduce and device sync every iteration. `TrainState.current_loss` and `TrainState.losses` are removed from TrainState `state_dict()` and `load_state_dict()` call. Tested with saving/loading with 30 steps with log_frequency = 10 and loading with 40 steps to resume training. The numerics in global_avg_losses and global_max_losses in the list aligns with expected. ``` Step 30 save: [rank0]:before save: self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21]) Step 30 load: [rank0]:after load: self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21]) Step 40 load and resume training: [rank0]:before save: self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31]) ```
Configuration menu - View commit details
-
Copy full SHA for c49cc9e - Browse repository at this point
Copy the full SHA c49cc9eView commit details -
Basic integration test infra (pytorch#170)
Summary: PR adds an option `use_for_integration_test`. when set to `True`, this adds the config to the integration test suite. A test runner picks all the configs marked for integration test and run them. Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946 [rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep: 1 �[32mloss: 10.9486 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 20,066 �[35mmfu: 0.25%�[39m [rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep: 2 �[32mloss: 10.8786 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,046 �[35mmfu: 2.60%�[39m [rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep: 3 �[32mloss: 10.7362 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 204,441 �[35mmfu: 2.50%�[39m [rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep: 4 �[32mloss: 10.5094 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,800 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep: 5 �[32mloss: 10.2755 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,527 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep: 6 �[32mloss: 10.0318 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,117 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep: 7 �[32mloss: 9.7929 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,509 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep: 8 �[32mloss: 9.5539 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 166,639 �[35mmfu: 2.04%�[39m [rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep: 9 �[32mloss: 9.3909 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 120,381 �[35mmfu: 1.47%�[39m [rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10 �[32mloss: 9.2749 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 207,613 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2b017fd - Browse repository at this point
Copy the full SHA 2b017fdView commit details -
Add 2D integration test (FSDP + TP) (pytorch#171)
Summary: Add a 2D test to integration test suite Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] ***************************************** W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429 [rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep: 1 �[32mloss: 10.7425 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 21,337 �[35mmfu: 0.26%�[39m [rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep: 2 �[32mloss: 10.6722 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 208,060 �[35mmfu: 2.55%�[39m [rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep: 3 �[32mloss: 10.5435 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 213,622 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep: 4 �[32mloss: 10.3359 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,856 �[35mmfu: 2.61%�[39m [rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep: 5 �[32mloss: 10.0965 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 209,326 �[35mmfu: 2.56%�[39m [rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep: 6 �[32mloss: 9.8806 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,808 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep: 7 �[32mloss: 9.6442 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,874 �[35mmfu: 2.63%�[39m [rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep: 8 �[32mloss: 9.4349 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 220,877 �[35mmfu: 2.70%�[39m [rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep: 9 �[32mloss: 9.2674 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 123,924 �[35mmfu: 1.52%�[39m [rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10 �[32mloss: 9.1404 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,910 �[35mmfu: 2.63%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 =====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model_2d.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] ***************************************** W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2] [rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430 [rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep: 1 �[32mloss: 10.8502 �[33mmemory: 5.71GiB(6.01%) �[34mwps: 9,259 �[35mmfu: 0.11%�[39m [rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep: 2 �[32mloss: 10.7671 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 54,430 �[35mmfu: 0.67%�[39m [rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep: 3 �[32mloss: 10.6390 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 88,457 �[35mmfu: 1.08%�[39m [rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep: 4 �[32mloss: 10.4210 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 90,384 �[35mmfu: 1.11%�[39m [rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep: 5 �[32mloss: 10.1648 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 93,058 �[35mmfu: 1.14%�[39m [rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep: 6 �[32mloss: 9.9451 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 72,642 �[35mmfu: 0.89%�[39m [rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep: 7 �[32mloss: 9.7004 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 85,096 �[35mmfu: 1.04%�[39m [rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep: 8 �[32mloss: 9.4422 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 95,860 �[35mmfu: 1.17%�[39m [rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep: 9 �[32mloss: 9.2144 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 55,837 �[35mmfu: 0.68%�[39m [rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10 �[32mloss: 9.1710 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 86,136 �[35mmfu: 1.05%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ab5d918 - Browse repository at this point
Copy the full SHA ab5d918View commit details
Commits on Mar 28, 2024
-
Used per-parameter FSDP (pytorch#165)
**Numeric Parity** 1D FSDP - Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter - FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS - FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS - FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS - FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS - Loss curves match between FSDP1 and FSDP2 - Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory - Compile: same setup as eager - FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU - Loss curves slightly better than eager - For fun -- how much can we push MFU? - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU. - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU. - Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), `reshard_after_forward=False` for the last transformer block 2D FSDP - Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter - FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS - FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS - Loss curves match 8-way FSDP - FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_` not all-reducing over TP mesh dimension <details> <summary> Loss curves </summary> <img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM" src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85"> </details> **Meta-Device Initialization** - The PyTorch Core guideline is for `module.reset_parameters()` to only initialize parameters/buffers immediately owned by `module` (i.e. `module.parameters(recurse=False)` and `module.buffers(recurse=False)`). - This makes it challenging to specify custom initializations for core modules like `nn.Linear` and `nn.Embedding`. For example, in @lessw2020's depth-wise truncated normal initialization, the `trunc_normal_` standard deviation depends on the layer ID, which is a property of the `TransformerBlock` but affects the child `nn.Linear`s. - To disambiguate, I suggest avoiding the name `reset_parameters()` in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. `init_weights`). **DCP & Save/Load** - Tested 1D and 2D by specifying `checkpoint_folder = "/tmp/checkpoint_andgu` in the `.toml`, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable
Configuration menu - View commit details
-
Copy full SHA for 83c879f - Browse repository at this point
Copy the full SHA 83c879fView commit details -
plot losses in loaded TrainState to TensorBoard
ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b Pull Request resolved: pytorch#173
Configuration menu - View commit details
-
Copy full SHA for f6d9de7 - Browse repository at this point
Copy the full SHA f6d9de7View commit details
Commits on Mar 29, 2024
-
Removed setting global flag for
swap_tensors
since not needed anymoreghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295 Pull Request resolved: pytorch#178
Configuration menu - View commit details
-
Copy full SHA for 1150944 - Browse repository at this point
Copy the full SHA 1150944View commit details
Commits on Apr 2, 2024
-
Add integration test with compile enabled (pytorch#183)
Summary: same as title Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model_compile.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] ***************************************** W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training [rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace [rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model [rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model [rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled [rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754 [rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled [rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754 [rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank1]: warnings.warn( [rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank0]: warnings.warn( [rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep: 1 �[32mloss: 10.9746 �[33mmemory: 9.53GiB(10.03%) �[34mwps: 1,228 �[35mmfu: 0.02%�[39m [rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep: 2 �[32mloss: 10.9091 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 208,739 �[35mmfu: 2.56%�[39m [rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep: 1 �[32mloss: 10.9746 �[33mmemory: 9.53GiB(10.03%) �[34mwps: 1,228 �[35mmfu: 0.02%�[39m [rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep: 2 �[32mloss: 10.9091 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 208,501 �[35mmfu: 2.55%�[39m [rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep: 3 �[32mloss: 10.7722 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 219,416 �[35mmfu: 2.69%�[39m [rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep: 3 �[32mloss: 10.7722 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 219,182 �[35mmfu: 2.68%�[39m [rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep: 4 �[32mloss: 10.5428 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 218,226 �[35mmfu: 2.67%�[39m [rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep: 4 �[32mloss: 10.5428 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 218,015 �[35mmfu: 2.67%�[39m [rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep: 5 �[32mloss: 10.3063 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,094 �[35mmfu: 2.54%�[39m [rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep: 5 �[32mloss: 10.3063 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,220 �[35mmfu: 2.54%�[39m [rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep: 6 �[32mloss: 10.0707 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 210,814 �[35mmfu: 2.58%�[39m [rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep: 7 �[32mloss: 9.8302 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 209,649 �[35mmfu: 2.57%�[39m [rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep: 6 �[32mloss: 10.0707 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 210,849 �[35mmfu: 2.58%�[39m [rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep: 7 �[32mloss: 9.8302 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 209,542 �[35mmfu: 2.57%�[39m [rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep: 8 �[32mloss: 9.5918 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 211,690 �[35mmfu: 2.59%�[39m [rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep: 8 �[32mloss: 9.5918 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 211,786 �[35mmfu: 2.59%�[39m [rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep: 9 �[32mloss: 9.4299 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 125,833 �[35mmfu: 1.54%�[39m [rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep: 9 �[32mloss: 9.4299 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 125,765 �[35mmfu: 1.54%�[39m [rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10 �[32mloss: 9.2955 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,661 �[35mmfu: 2.54%�[39m [rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10 �[32mloss: 9.2955 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,426 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 25ee32f - Browse repository at this point
Copy the full SHA 25ee32fView commit details
Commits on Apr 3, 2024
-
remove folding and unfolding of sequence dim in model.py
ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db Pull Request resolved: pytorch#190
Configuration menu - View commit details
-
Copy full SHA for 25f9bff - Browse repository at this point
Copy the full SHA 25f9bffView commit details
Commits on Apr 4, 2024
-
bump comm.train_timeout_seconds (pytorch#189)
this PR bumps this default config to a larger value, as profiling is pretty heavy step, a default 5 seconds would likely trigger watchdog unintentionally
Configuration menu - View commit details
-
Copy full SHA for c233ecd - Browse repository at this point
Copy the full SHA c233ecdView commit details
Commits on Apr 5, 2024
-
ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e Pull Request resolved: pytorch#197
Configuration menu - View commit details
-
Copy full SHA for bb3919d - Browse repository at this point
Copy the full SHA bb3919dView commit details -
support sequence of tests and add checkpoint test
address comments ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3 Pull Request resolved: pytorch#198
Configuration menu - View commit details
-
Copy full SHA for 4d593d4 - Browse repository at this point
Copy the full SHA 4d593d4View commit details -
Make freqs_cis a persistent buffer for pp init
currently, planning to use a 'seed checkpoint' to initialize the pipeline parallel model chunks after moving them from meta device to cuda/empty. non-persistent buffers are incompatible with this approach, as they are missing from the checkpoint and thus require manual init. an alternative is to manually run the initializer for just the non-persistent buffers after loading a seed-checkpoint, but this approach is nearly equivalent with less code changes. ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934 Pull Request resolved: pytorch#201
Configuration menu - View commit details
-
Copy full SHA for 5a0995a - Browse repository at this point
Copy the full SHA 5a0995aView commit details -
Delete grad scaler, which is unsupported/unused
grad scaler currently doesn't work with FSDP2, and isn't enabled anyway becuase bf16 training is the norm and doens't require it. remove it for simplicity. it will be easier to enable pipeline parallelism with a simplier loss function setup, but if desired, its still possible to support pipeline parallelism with the scaler added back in. ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f Pull Request resolved: pytorch#202
Configuration menu - View commit details
-
Copy full SHA for db204f9 - Browse repository at this point
Copy the full SHA db204f9View commit details -
Factor out loss_fn to share code with pipeline par
PP requires feeding a loss_fn into the schedule's step so that loss can be computed per microbatch as part of the forward/backward scheduling. As such, it is nice to define loss once and use it both in the non-pp code that manually calls f/loss/b and also use it in the pp step(). ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12 Pull Request resolved: pytorch#203
Configuration menu - View commit details
-
Copy full SHA for 859963d - Browse repository at this point
Copy the full SHA 859963dView commit details -
[TorchTrain] Minor fix for pytorch#197 (pytorch#204)
The changes made in github editor didn't go in when doing ghstack land.
Configuration menu - View commit details
-
Copy full SHA for 5d2c148 - Browse repository at this point
Copy the full SHA 5d2c148View commit details -
Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable…
… config selectable Norm Type (pytorch#181) This PR has multiple aspects: 1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's numerical accuracy on both forward and backward with a unit test. It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%: <img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7"> 2 - Adds norms.py to house all 4 norm types, and standardizes to [layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a create_norms function that then creates the appropriate norm. 3 - Adds np_layernorm, which is layernorm with no affine transformation. 4 - Updates model.py to now support plug and play of any supported norm. Thus instead of this type of if/then logic in the model class: <img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693"> We simply have this: <img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2"> This then allows for easy plug and play of any norm type with no fiddling around in the model code. 5 - updates run_llama_train.sh to randomly select a port vs previous fixed port number. (thanks @yifuwang for this tip!) 6 - Now users can quickly select the norm of their choice via the config file: <img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb"> 7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid any confusion (per @tianyu-l feedback): ~~~ NotImplementedError: fused_rmsnorm not yet compatible with TP. Please use rmsnorm. ~~~
Configuration menu - View commit details
-
Copy full SHA for 3471165 - Browse repository at this point
Copy the full SHA 3471165View commit details -
ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e Pull Request resolved: pytorch#206
Configuration menu - View commit details
-
Copy full SHA for 5b2bb52 - Browse repository at this point
Copy the full SHA 5b2bb52View commit details
Commits on Apr 10, 2024
-
Removed cache_k and cache_v comments
ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7 Pull Request resolved: pytorch#213
Configuration menu - View commit details
-
Copy full SHA for 7146841 - Browse repository at this point
Copy the full SHA 7146841View commit details -
ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336 Pull Request resolved: pytorch#212
Configuration menu - View commit details
-
Copy full SHA for c18d760 - Browse repository at this point
Copy the full SHA c18d760View commit details -
avoid record streams and make color printing a config
ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc Pull Request resolved: pytorch#195
Configuration menu - View commit details
-
Copy full SHA for e62573d - Browse repository at this point
Copy the full SHA e62573dView commit details -
fix SAC to use the correct reduce_scatter op (pytorch#215)
as titled, we migrated to the native functional collective so the SAC should capture this instead of the old one
Configuration menu - View commit details
-
Copy full SHA for 7419d71 - Browse repository at this point
Copy the full SHA 7419d71View commit details -
Test runner raises exception on failures (pytorch#216)
Summary: Test runner should raise exception on failures. Test Plan: ``` =====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh ===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 0 -ne 0 ']' =====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=--training.compile + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 1 -ne 0 ']' + overrides=--training.compile + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last): [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module> [rank0]:[rank0]: main(config) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper [rank0]:[rank0]: return f(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main [rank0]:[rank0]: pred = model(input_ids) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]:[rank0]: return self._call_impl(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]:[rank0]: return forward_call(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn [rank0]:[rank0]: return fn(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]:[rank0]: return self._call_impl(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl [rank0]:[rank0]: result = forward_call(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors [rank0]:[rank0]: return callback(frame, cache_entry, hooks, frame_state, skip=1) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame [rank0]:[rank0]: result = inner_convert( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert [rank0]:[rank0]: return _compile( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function [rank0]:[rank0]: return function(*args, **kwargs) [rank0]:[rank0]: File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner [rank0]:[rank0]: return func(*args, **kwds) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile [rank0]:[rank0]: guarded_code = compile_inner(code, one_graph, hooks, transform) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper [rank0]:[rank0]: r = func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner [rank0]:[rank0]: out_code = transform_code_object(code, transform) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object [rank0]:[rank0]: transformations(instructions, code_options) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn [rank0]:[rank0]: return fn(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run [rank0]:[rank0]: super().run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function [rank0]:[rank0]: return variables.UserFunctionVariable(fn, source=source).call_function( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function [rank0]:[rank0]: return func_var.call_function(tx, [obj_var] + args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX [rank0]:[rank0]: self.call_function(fn, argsvars.items, kwargsvars) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function [rank0]:[rank0]: return super().call_function(tx, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function [rank0]:[rank0]: return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return [rank0]:[rank0]: return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call [rank0]:[rank0]: return cls.inline_call_(parent, func, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run [rank0]:[rank0]: while self.step(): [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step [rank0]:[rank0]: self.dispatch_table[inst.opcode](self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper [rank0]:[rank0]: return inner_fn(self, inst) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION [rank0]:[rank0]: self.call_function(fn, args, {}) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function [rank0]:[rank0]: self.push(fn.call_function(self, args, kwargs)) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function [rank0]:[rank0]: return self.obj.call_method(tx, self.name, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method [rank0]:[rank0]: return wrap_fx_proxy( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy [rank0]:[rank0]: return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls [rank0]:[rank0]: example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value [rank0]:[rank0]: raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value [rank0]:[rank0]: ret_val = wrap_fake_exception( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception [rank0]:[rank0]: return fn() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda> [rank0]:[rank0]: lambda: run_node(tx.output, node, args, kwargs, nnmodule) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node [rank0]:[rank0]: raise RuntimeError(make_error_message(e)).with_traceback( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node [rank0]:[rank0]: return getattr(args[0], node.target)(*args[1:], **kwargs) [rank0]:[rank0]: torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:0', size=(852480,), dtype=torch.bfloat16),), **{}): [rank0]:[rank0]: 'FakeTensor' object has no attribute 'wait' [rank0]: [rank0]:[rank0]: from user code: [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward [rank0]:[rank0]: h = layer(h, freqs_cis) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl [rank0]:[rank0]: args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward [rank0]:[rank0]: args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward [rank0]:[rank0]: self.wait_for_unshard() [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard [rank0]:[rank0]: foreach_all_gather_copy_out( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]:[rank0]: return func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out [rank0]:[rank0]: all_gather_work.wait() [rank0]: [rank0]:[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information [rank0]: [rank0]: [rank0]:[rank0]: You can suppress this exception and fall back to eager by setting: [rank0]:[rank0]: import torch._dynamo [rank0]:[rank0]: torch._dynamo.config.suppress_errors = True [rank0]: E0410 13:32:53.256000 139839630783488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1554760) of binary: /home/gnadathur/local/a/pytorch-env/bin/python E0410 13:32:53.261000 139839630783488 torch/distributed/elastic/multiprocessing/errors/error_handler.py:136] no error file defined for parent, to copy child error file (/tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/0/error.json) [rank0]:NCCL version 2.20.5+cuda12.0 Traceback (most recent call last): File "/home/gnadathur/local/a/pytorch-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/run.py", line 879, in main run(args) File "/data/users/gnadathur/a/pytorch/torch/distributed/run.py", line 870, in run elastic_launch( File "/data/users/gnadathur/a/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/gnadathur/a/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-04-10_13:32:49 host : devvm4378.nao0.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 1554762) error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main pred = model(input_ids) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl result = forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame result = inner_convert( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert return _compile( File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function return function(*args, **kwargs) File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object transformations(instructions, code_options) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run super().run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function return variables.UserFunctionVariable(fn, source=source).call_function( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function return func_var.call_function(tx, [obj_var] + args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function return self.obj.call_method(tx, self.name, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method return wrap_fx_proxy( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value ret_val = wrap_fake_exception( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception return fn() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node return getattr(args[0], node.target)(*args[1:], **kwargs) torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:1', size=(852480,), dtype=torch.bfloat16),), **{}): 'FakeTensor' object has no attribute 'wait' from user code: File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward h = layer(h, freqs_cis) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward self.wait_for_unshard() File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard foreach_all_gather_copy_out( File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out all_gather_work.wait() Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True [2]: time : 2024-04-10_13:32:49 host : devvm4378.nao0.facebook.com rank : 2 (local_rank: 2) exitcode : 1 (pid: 1554763) error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main pred = model(input_ids) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl result = forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame result = inner_convert( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert return _compile( File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function return function(*args, **kwargs) File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object transformations(instructions, code_options) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run super().run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function return variables.UserFunctionVariable(fn, source=source).call_function( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function return func_var.call_function(tx, [obj_var] + args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function return super().call_function(tx, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_ tracer.run() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run while self.step(): File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper return inner_fn(self, inst) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function return self.obj.call_method(tx, self.name, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method return wrap_fx_proxy( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value ret_val = wrap_fake_exception( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception return fn() File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node return getattr(args[0], node.target)(*args[1:], **kwargs) torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:2', size=(852480,), dtype=torch.bfloat16),), **{}): 'FakeTensor' object has no attribute 'wait' from user code: File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward h = layer(h, freqs_cis) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward self.wait_for_unshard() File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard foreach_all_gather_copy_out( File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out all_gather_work.wait() Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True [3]: time : 2024-04-10_13:32:49 host : devvm4378.nao0.facebook.com rank : 3 (local_rank: 3) exitcode : 1 (pid: 1554764) error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main pred = model(input_ids) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl result = forward_call(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame result = inner_convert( File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert return _compile( File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function return function(*args, **kwargs) File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper r = func(*args, **kwargs) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object transformations(instructions, code_options) File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.p ```
Configuration menu - View commit details
-
Copy full SHA for cfdd4af - Browse repository at this point
Copy the full SHA cfdd4afView commit details -
Revert "Separate TransformerEmbedding layer (pytorch#33)"
Avoid diverging the model structure (FQNs and checkpoint interoperability) with similar models. This reverts commit f30202c. ghstack-source-id: 9811f5fa99fdde387efe6018aa00afd28e7e923b Pull Request resolved: pytorch#214
Configuration menu - View commit details
-
Copy full SHA for 144b229 - Browse repository at this point
Copy the full SHA 144b229View commit details -
Fix 2DParallel test (pytorch#219)
Use `rmsnorm` instead of fused version since 2D does not support fused version yet. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=--training.tensor_parallel_degree + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 3 -ne 0 ']' + overrides='--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm' + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.tensor_parallel_degree 2 --model.norm_type=rmsnorm W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] ***************************************** W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 15:50:37,794 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 15:50:37,986 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 15:50:38,464 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2] [rank0]:2024-04-10 15:50:38,467 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 15:50:38,474 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 15:50:38,474 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 15:50:40,306 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='rmsnorm') [rank0]:2024-04-10 15:50:40,318 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 15:50:40,319 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 15:50:40,331 - root - INFO - Applied Tensor Parallelism to the model [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 15:50:40,558 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 15:50:40,558 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1550 [rank0]:2024-04-10 15:50:40,562 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 15:50:40,562 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-04-10 15:50:41,474 - root - INFO - �[36mstep: 1 �[32mloss: 10.8403 �[33mmemory: 5.76GiB(6.06%) �[34mwps: 8,988 �[35mmfu: 0.11%�[39m [rank0]:2024-04-10 15:50:41,475 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-04-10 15:50:41,652 - root - INFO - �[36mstep: 2 �[32mloss: 10.7703 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 46,364 �[35mmfu: 0.57%�[39m [rank0]:2024-04-10 15:50:41,744 - root - INFO - �[36mstep: 3 �[32mloss: 10.6447 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,916 �[35mmfu: 1.10%�[39m [rank0]:2024-04-10 15:50:41,847 - root - INFO - �[36mstep: 4 �[32mloss: 10.4428 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 80,467 �[35mmfu: 0.99%�[39m [rank0]:2024-04-10 15:50:41,946 - root - INFO - �[36mstep: 5 �[32mloss: 10.1726 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 83,747 �[35mmfu: 1.03%�[39m [rank0]:2024-04-10 15:50:42,038 - root - INFO - �[36mstep: 6 �[32mloss: 9.9676 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 89,380 �[35mmfu: 1.09%�[39m [rank0]:2024-04-10 15:50:42,135 - root - INFO - �[36mstep: 7 �[32mloss: 9.7356 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,526 �[35mmfu: 1.05%�[39m [rank0]:2024-04-10 15:50:42,232 - root - INFO - �[36mstep: 8 �[32mloss: 9.4619 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 85,349 �[35mmfu: 1.05%�[39m [rank0]:2024-04-10 15:50:42,396 - root - INFO - �[36mstep: 9 �[32mloss: 9.2633 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 50,402 �[35mmfu: 0.62%�[39m [rank0]:[rank0]:[W410 15:50:42.021475256 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-04-10 15:50:42,511 - root - INFO - �[36mstep: 10 �[32mloss: 9.2156 �[33mmemory: 6.74GiB(7.09%) �[34mwps: 71,449 �[35mmfu: 0.88%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ```
Configuration menu - View commit details
-
Copy full SHA for 05c181d - Browse repository at this point
Copy the full SHA 05c181dView commit details -
ghstack-source-id: a9204c68f2e315c878677be86c509fc8d6290ffd Pull Request resolved: pytorch#218
Configuration menu - View commit details
-
Copy full SHA for b6414aa - Browse repository at this point
Copy the full SHA b6414aaView commit details
Commits on Apr 11, 2024
-
[TorchTrain][Checkpoint] Add model_weights_only option to train_config (
pytorch#220) With `model_weights_only` set to True, we would checkpoint model weights only at the end of the training. We only consider saving model weights at the end of the training so this won't affect preemption and training resume. With `model_weight_only = True`, we can see the size of checkpoint is 1/3 of a full checkpoint (74M at step 10 when training completes vs. 212M at step 5). With this, the converted checkpoint (DCP -> torch.save) can be loaded with `torch.load(..., weights_only=True)`. ``` (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ python -m torch.distributed.checkpoint.format_utils dcp_to_torch step-10 step-10-model-weights-only.pt Converting checkpoint from step-10 to step-10-model-weights-only.pt using method: 'dcp_to_torch' (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ ls step-10 step-10-model-weights-only.pt step-5 (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ ls -h step-10 step-10-model-weights-only.pt step-5 (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ du -h 212M ./step-5 74M ./step-10 358M . (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ du -h step-10-model-weights-only.pt 74M step-10-model-weights-only.pt (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ python3 Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.load('step-10-model-weights-only.pt', weights_only=True) {'model': {'embeddings.freqs_cis': tensor([[ 1.0000+0.0000e+00j, 1.0000+0.0000e+00j, 1.0000+0.0000e+00j, ..., 1.0000+0.0000e+00j, 1.0000+0.0000e+00j, 1.0000+0.0000e+00j], ``` One more additional change: logging to all ranks on `test_runner.py`.
Configuration menu - View commit details
-
Copy full SHA for 07a3ec8 - Browse repository at this point
Copy the full SHA 07a3ec8View commit details -
Rename to torchtitan (pytorch#221)
Trying out a full renaming pass from torchtrian -> torchtitan, including: 1. directory structure 2. all names inside the repo itself.
Configuration menu - View commit details
-
Copy full SHA for c22d1a8 - Browse repository at this point
Copy the full SHA c22d1a8View commit details
Commits on Apr 12, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 55a0187 - Browse repository at this point
Copy the full SHA 55a0187View commit details -
Add 1 sec delay to rank 0 cleanup (pytorch#224)
Add the delay as a short term workaround the TCPStore cleanup sync issue (pytorch/pytorch#123969) Test: Ran `TORCH_NCCL_ABORT_IN_DESTROY_PG=1 CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_llama_train.sh --checkpoint.folder ./test_runner_checkpoint_full_checkpoint` 10 times w/o failure.
Configuration menu - View commit details
-
Copy full SHA for 2373509 - Browse repository at this point
Copy the full SHA 2373509View commit details -
[Torchtrain][Checkpoint] Add support to allow dtype conversion (pytor…
…ch#222) Adds a field of checkpoint.export_dtype: we allow dtype conversion only when we are checkpoint model weights only and the current dtype is not the same as the export dtype at the end of the training. Also add a change to get rid of `freqs_cis` buffer when exporting. We can see with export_dtype=bf16, the model weights is about half of the size when comparing to export_dtype=fp32. ``` # model_weights_only=false (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_full_checkpoint 212M test_runner_checkpoint_full_checkpoint/step-5 212M test_runner_checkpoint_full_checkpoint/step-10 212M test_runner_checkpoint_full_checkpoint/step-15 212M test_runner_checkpoint_full_checkpoint/step-20 846M test_runner_checkpoint_full_checkpoint # model_weights_only=true and export_dtype = fp32 (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_model_weights_only 212M test_runner_checkpoint_model_weights_only/step-5 70M test_runner_checkpoint_model_weights_only/step-10 281M test_runner_checkpoint_model_weights_only # model_weights_only=true and export_dtype = bf16 (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_model_weights_only_bf16 212M test_runner_checkpoint_model_weights_only_bf16/step-5 35M test_runner_checkpoint_model_weights_only_bf16/step-10 247M test_runner_checkpoint_model_weights_only_bf16 ```
Configuration menu - View commit details
-
Copy full SHA for fd5ad5a - Browse repository at this point
Copy the full SHA fd5ad5aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 009b14f - Browse repository at this point
Copy the full SHA 009b14fView commit details
Commits on Apr 15, 2024
-
ghstack-source-id: 33295ce9c9038163e903867cd81799e8848cc749 Pull Request resolved: pytorch#228
Configuration menu - View commit details
-
Copy full SHA for c7d5865 - Browse repository at this point
Copy the full SHA c7d5865View commit details
Commits on Apr 16, 2024
-
Update README to reflect positioning (pytorch#229)
as titled, update README to reflect our positioning for the repo
Configuration menu - View commit details
-
Copy full SHA for f86bfb2 - Browse repository at this point
Copy the full SHA f86bfb2View commit details -
First release readme (pytorch#227)
Reworked readme to highlight first release and feature set. q - use our logo? (I think it adds some spark). Visual preview: <img width="898" alt="Screenshot 2024-04-14 at 7 02 39 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/60b4b6a8-c4f3-41a8-8d8d-27b924f8de15">
Configuration menu - View commit details
-
Copy full SHA for a10262a - Browse repository at this point
Copy the full SHA a10262aView commit details -
Configuration menu - View commit details
-
Copy full SHA for a0a7ff7 - Browse repository at this point
Copy the full SHA a0a7ff7View commit details -
use permalink for logo image (pytorch#232)
update logo to permalink to ensure viewable by all.
Configuration menu - View commit details
-
Copy full SHA for d8b7c7f - Browse repository at this point
Copy the full SHA d8b7c7fView commit details -
[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and…
… a few config updates (pytorch#230) Let CheckpointManager take entire job_config as an arg so we can keep train.py a little bit cleaner. Discussed with @tianyu-l and made a few additional changes, including: 1. Rename "run_profiler" to "enable_profiling" 2. Add an "enable_checkpoint" flag so it is consistent to "enable_profiling" or "enable_tensorboard". We feel like this is a little bit more explicit. 3. Change the default checkpoint folder to be ".outputs/checkpoint" when checkpoint is enabled. 4. Rename "folder" in [checkpiont]" to be "checkpoint_folder" 5. Change save_traces_folder to be "./outputs/profile_trace" from ".outputs/profiling/traces".
Configuration menu - View commit details
-
Copy full SHA for 6596219 - Browse repository at this point
Copy the full SHA 6596219View commit details -
use combo of html and local file src for logo (pytorch#234)
It seems the permalink for the logo is not fully working as expected. thus switching to combo of html plus local file reference for src.
Configuration menu - View commit details
-
Copy full SHA for 1601d35 - Browse repository at this point
Copy the full SHA 1601d35View commit details -
add performance -- infra metrics and loss curves (pytorch#237) (pytor…
…ch#238) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#237 WPS / MFU numbers, and loss curves jobs can be found from this tracking [spreadsheet](https://docs.google.com/spreadsheets/d/11kcula5ybuABSZkm2OlFng5NQ9_rnVB-KRyeQq6P7fo/edit#gid=0). Co-authored-by: tianyu-l <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 63d752b - Browse repository at this point
Copy the full SHA 63d752bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 10b572d - Browse repository at this point
Copy the full SHA 10b572dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7781fd7 - Browse repository at this point
Copy the full SHA 7781fd7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 441b33f - Browse repository at this point
Copy the full SHA 441b33fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 53dc5eb - Browse repository at this point
Copy the full SHA 53dc5ebView commit details -
Add torchtune checkpoint link, modify product position statement loca…
…tion (pytorch#241) This PR: 1 - add's feature note and link to checkpoint doc on supporting torchtitan weights being saved and loaded into torchtune for fine tuning. 2 - moves the product position info from top of page to bottom.
Configuration menu - View commit details
-
Copy full SHA for 16701c3 - Browse repository at this point
Copy the full SHA 16701c3View commit details -
Configuration menu - View commit details
-
Copy full SHA for b889f3d - Browse repository at this point
Copy the full SHA b889f3dView commit details -
minor doc updates - remove asynch checkpt ref, grammar on prod positi…
…on, update checkpointing from 5 to 500 (pytorch#243) 3 minor readme / doc updates. 1 - remove : and please note from product position statement. 2 - remove (asynch checkpointing) from current feature listing of dist checkpointing (it's noted as pending feature). 3 - update default checkpoint interval from 5 to 500
Configuration menu - View commit details
-
Copy full SHA for b60c6bd - Browse repository at this point
Copy the full SHA b60c6bdView commit details -
Fix multi-line string usage (pytorch#244)
Summary: use `"""` for multi-line strings instead of tuple syntax which breaks arg parse. Test Plan: ``` ============================= test session starts ============================== platform linux -- Python 3.10.14, pytest-8.1.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/gnadathur/a/torchtitan/.hypothesis/examples')) benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /data/users/gnadathur/a/torchtitan configfile: pyproject.toml plugins: hypothesis-6.100.1, benchmark-4.0.0, typeguard-4.2.1, cov-5.0.0, hydra-core-1.3.2 collecting ... collected 6 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 16%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 33%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [ 50%] test/test_job_config.py::TestJobConfig::test_empty_config_file PASSED [ 66%] test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides PASSED [ 83%] test/test_job_config.py::TestJobConfig::test_print_help PASSED [100%] ---------- coverage: platform linux, python 3.10.14-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.00s call test/test_job_config.py::TestJobConfig::test_print_help 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides 0.00s call test/test_job_config.py::TestJobConfig::test_empty_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_print_help 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_empty_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_print_help 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides 0.00s teardown test/test_job_config.py::TestJobConfig::test_empty_config_file ============================== 6 passed in 0.19s =============================== ```
Configuration menu - View commit details
-
Copy full SHA for 09d0047 - Browse repository at this point
Copy the full SHA 09d0047View commit details -
ghstack-source-id: 287d31e9a14861244f1292f61604a296fb7d4e11 Pull Request resolved: pytorch#245
Configuration menu - View commit details
-
Copy full SHA for c9454d3 - Browse repository at this point
Copy the full SHA c9454d3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9537825 - Browse repository at this point
Copy the full SHA 9537825View commit details
Commits on Apr 17, 2024
-
fix default max_seq_len for freq_cis init (pytorch#248)
as titled, looks like llama2 default one is 2048 instead of the current number (source https://github.com/meta-llama/llama/blob/main/llama/model.py#L31)
Configuration menu - View commit details
-
Copy full SHA for 7af51cf - Browse repository at this point
Copy the full SHA 7af51cfView commit details -
set max_seq_len before training to make it align with input data (pyt…
…orch#249) as titled, we need to set this to get the accurate seq_length set from the dataloader config. This would ensure the max_seq_len always correct so that rope init would be always correct <img width="946" alt="Screenshot 2024-04-17 at 1 00 29 PM" src="https://github.com/pytorch/torchtitan/assets/9443650/39942187-cf37-4cef-b380-644a1a9b9d35">
Configuration menu - View commit details
-
Copy full SHA for 0c655b8 - Browse repository at this point
Copy the full SHA 0c655b8View commit details -
ghstack-source-id: e7f7f4d6f1685072ded6da899bac3ed1ba22dffa Pull Request resolved: pytorch#247
Configuration menu - View commit details
-
Copy full SHA for 9949284 - Browse repository at this point
Copy the full SHA 9949284View commit details
Commits on Apr 18, 2024
-
ghstack-source-id: 7c390da9d746a75a8c93811c21fb92fb418ae08b Pull Request resolved: pytorch#252
Configuration menu - View commit details
-
Copy full SHA for bfe9998 - Browse repository at this point
Copy the full SHA bfe9998View commit details -
Add c4_mini, a local 45K dataset (subset of c4) (pytorch#253)
This PR adds a 45K (and thus just under the github 100MB limit) local dataset. This enables: a - a ready to run dataset for users to run debug model with b - local dataset for CI c - dataset that is not relying on HuggingFace connection (recall when HF went down and everything came to halt). <img width="1275" alt="Screenshot 2024-04-17 at 8 09 13 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/89df4ea8-37f4-4705-a6ed-4ca9415409f3">
Configuration menu - View commit details
-
Copy full SHA for f80223b - Browse repository at this point
Copy the full SHA f80223bView commit details -
remove logo, update pre-release date to 4/18 (pytorch#254)
as per title - remove logo until we have marketing approval and update readme pre-release date from 4/16 to 4/18.
Configuration menu - View commit details
-
Copy full SHA for 6926922 - Browse repository at this point
Copy the full SHA 6926922View commit details -
testing embedding video into readme. Note that embedded videos are not supported, so the best we can do here is mimic it with a thumbnail and play button, that then jumps you to YT playing the video.
Configuration menu - View commit details
-
Copy full SHA for d6f72e2 - Browse repository at this point
Copy the full SHA d6f72e2View commit details -
add performance file to show convergence with 64 a100s (pytorch#255)
add performance.md to show the convergence curves (file is from @tianyu-l ).
Configuration menu - View commit details
-
Copy full SHA for 395a526 - Browse repository at this point
Copy the full SHA 395a526View commit details
Commits on Apr 20, 2024
-
Support Llama3 8b/70b (pytorch#256)
This PR adds support for Llama3 8b/70b, mainly it: - add tiktonizer, add instructions to download tokenizer - add options for the llama model to support Llama3 - add Llama3 8b/70b configs
Configuration menu - View commit details
-
Copy full SHA for df2dcc7 - Browse repository at this point
Copy the full SHA df2dcc7View commit details
Commits on Apr 22, 2024
-
ghstack-source-id: 4dd1cdb033e840e00cacd98339780424231b595b Pull Request resolved: pytorch#257
Configuration menu - View commit details
-
Copy full SHA for 2db26cf - Browse repository at this point
Copy the full SHA 2db26cfView commit details
Commits on Apr 23, 2024
-
reenable integration tests with a test tokenizer (pytorch#259)
as titled, the test tokenizer borrowed from torchtune https://github.com/pytorch/torchtune/blob/main/tests/assets/tiktoken_small.model, where this small test model is offline generated from https://gist.github.com/ebsmothers/54b133dd87db6679b14318545aaa2de4 so it should have no correlation with any specific model/data
Configuration menu - View commit details
-
Copy full SHA for 4b60829 - Browse repository at this point
Copy the full SHA 4b60829View commit details
Commits on Apr 24, 2024
-
Configuration menu - View commit details
-
Copy full SHA for b2ee158 - Browse repository at this point
Copy the full SHA b2ee158View commit details -
De-dup repeated
freqs_cis
computation codeghstack-source-id: b4fe7f63f15bab367cf00b5d408eb43c640541c2 Pull Request resolved: pytorch#262
Configuration menu - View commit details
-
Copy full SHA for 3b51460 - Browse repository at this point
Copy the full SHA 3b51460View commit details -
update readme.md and performance.md
ghstack-source-id: a9bd1d33bf7bc9f5055a645c9639bcbe628afbfb Pull Request resolved: pytorch#258
Configuration menu - View commit details
-
Copy full SHA for 1ea476e - Browse repository at this point
Copy the full SHA 1ea476eView commit details -
followup changes to allow unsupported datasets
ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735 Pull Request resolved: pytorch#261
Configuration menu - View commit details
-
Copy full SHA for f8863bd - Browse repository at this point
Copy the full SHA f8863bdView commit details -
fix ac 'checkpointing' spelling, minor spacing tweaks (pytorch#265)
This PR is mainly to fix the spelling where activation checkpointing is missing an n... (**checkpoiting**). Not sure how I missed it earlier but it's glaring when you see the charts in visual form (vs text). <img width="578" alt="Screenshot 2024-04-24 at 2 45 25 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/a81727b2-07b1-4d69-a0c1-743d74d2aa5a"> fixed: <img width="592" alt="Screenshot 2024-04-24 at 3 10 30 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/769e51db-4aa6-4dbd-99d8-7e691658e280"> Also add a couple line breaks to help with layout, and one or two minor grammar updates.
Configuration menu - View commit details
-
Copy full SHA for 157a12c - Browse repository at this point
Copy the full SHA 157a12cView commit details
Commits on Apr 25, 2024
-
Update legal terms (pytorch#269)
Update to final legal license terms requested by Meta legal for release.
Configuration menu - View commit details
-
Copy full SHA for 0891fa3 - Browse repository at this point
Copy the full SHA 0891fa3View commit details -
ghstack-source-id: 2b74fe48dbeae0367a41214c6d0e8b1fcd608db8 Pull Request resolved: pytorch#270
Configuration menu - View commit details
-
Copy full SHA for aea510d - Browse repository at this point
Copy the full SHA aea510dView commit details -
Configuration menu - View commit details
-
Copy full SHA for e6d0d08 - Browse repository at this point
Copy the full SHA e6d0d08View commit details -
* Image was very blurry * Markdown formatting was off * Simplified some sentences
Configuration menu - View commit details
-
Copy full SHA for 15057dd - Browse repository at this point
Copy the full SHA 15057ddView commit details
Commits on Apr 26, 2024
-
fix lr scheduling by checkpointing scheduler
ghstack-source-id: 606aee2c4815173958b30ca34a3dbf8e90aed8de Pull Request resolved: pytorch#275
Configuration menu - View commit details
-
Copy full SHA for fd01061 - Browse repository at this point
Copy the full SHA fd01061View commit details -
insert barrier to profiler to resolve collectives timeout
ghstack-source-id: cc29739b147fe1f52bfc5b791330fd7cf1659652 Pull Request resolved: pytorch#271
Configuration menu - View commit details
-
Copy full SHA for 4333aca - Browse repository at this point
Copy the full SHA 4333acaView commit details -
some misc changes (pytorch#278)
1. update readme 2. small refactor to loss_parallel part
Configuration menu - View commit details
-
Copy full SHA for a3b529a - Browse repository at this point
Copy the full SHA a3b529aView commit details -
inherit stateful protocol where appropriate
ghstack-source-id: d410f30ec715bfb4206459becb95abeed5a4ae02 Pull Request resolved: pytorch#281
Configuration menu - View commit details
-
Copy full SHA for b898545 - Browse repository at this point
Copy the full SHA b898545View commit details
Commits on Apr 29, 2024
-
Fixed docs on HSDP sharding/replication dims
ghstack-source-id: 77f650e8281dae12f2a7ccdb415be88f9abd88cc Pull Request resolved: pytorch#283
Configuration menu - View commit details
-
Copy full SHA for 935b572 - Browse repository at this point
Copy the full SHA 935b572View commit details -
Add more Float8 description (pytorch#284)
# Summary Add more the possible options in the configs and add a note on how to get the dependency at the top of the file.
Configuration menu - View commit details
-
Copy full SHA for f61e0ba - Browse repository at this point
Copy the full SHA f61e0baView commit details -
Remove unneeded torchvision/audio deps
ghstack-source-id: dbd201ad2976537487123fa583c86ddab06a7387 Pull Request resolved: pytorch#250
Configuration menu - View commit details
-
Copy full SHA for 8697234 - Browse repository at this point
Copy the full SHA 8697234View commit details
Commits on Apr 30, 2024
-
Configuration menu - View commit details
-
Copy full SHA for a6d2625 - Browse repository at this point
Copy the full SHA a6d2625View commit details -
unify data loading from HF and from disk
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287
Configuration menu - View commit details
-
Copy full SHA for 258f608 - Browse repository at this point
Copy the full SHA 258f608View commit details
Commits on May 1, 2024
-
Add periodic integration test with signal (pytorch#289)
Runs the integration test hourly and updates signal badge. Tested on existing integration test. I will update the badge with periodic test signal once workflow has landed in this PR. <img width="516" alt="Screenshot 2024-04-30 at 6 12 00 PM" src="https://github.com/pytorch/torchtitan/assets/1779702/8adaab3d-df18-483d-a39f-5af316b7edbc">
Configuration menu - View commit details
-
Copy full SHA for 10ef7a6 - Browse repository at this point
Copy the full SHA 10ef7a6View commit details
Commits on May 2, 2024
-
exclude embedding in MFU computation
ghstack-source-id: 9daa99020c76fdfe429b6a9ee6d44fd1dd319fc3 Pull Request resolved: pytorch#280
Configuration menu - View commit details
-
Copy full SHA for 0c6ca90 - Browse repository at this point
Copy the full SHA 0c6ca90View commit details -
Add support for seed checkpoint creation for meta-init flow
Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and then save its initial state as a step-0 checkpoint for use with meta-initialization loading flow. ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454 Pull Request resolved: pytorch#172
Configuration menu - View commit details
-
Copy full SHA for e34d2ac - Browse repository at this point
Copy the full SHA e34d2acView commit details -
remove unnecessary install of torchtitan
ghstack-source-id: fa9aaf337b5489d88945f15b65a8ba8cc544ded6 Pull Request resolved: pytorch#295
Configuration menu - View commit details
-
Copy full SHA for 1480766 - Browse repository at this point
Copy the full SHA 1480766View commit details -
Remove unnecessary .to() inside model forward
This appears to be a holdover from a previous way the initialization worked. freqs_cis should already be on gpu device after initialization. ghstack-source-id: 7159320d4ecfb436bd2193277a88c04d136e9ad0 Pull Request resolved: pytorch#298
Configuration menu - View commit details
-
Copy full SHA for add0261 - Browse repository at this point
Copy the full SHA add0261View commit details
Commits on May 3, 2024
-
Fix the incorrect step log for profiler after resuming from a checkpo…
…int (pytorch#293) Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.
Configuration menu - View commit details
-
Copy full SHA for 3e2fa85 - Browse repository at this point
Copy the full SHA 3e2fa85View commit details -
turn off dynamic shape for torch.compile (pytorch#297)
as titled. This could make 1-D and 2-D works with the lastest main build. thanks @bdhirsh for all the fixes! We should figure out why dynamic shape gets turned on as a follow up
Configuration menu - View commit details
-
Copy full SHA for 5e84866 - Browse repository at this point
Copy the full SHA 5e84866View commit details -
Renamed
bsz
tobs
for consistency; removed dead codeghstack-source-id: bbedad3819ab9ef90b233209c34dd1dbc846b06a Pull Request resolved: pytorch#299
Configuration menu - View commit details
-
Copy full SHA for 8996249 - Browse repository at this point
Copy the full SHA 8996249View commit details
Commits on May 7, 2024
-
Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue. ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd Pull Request resolved: pytorch#313
Configuration menu - View commit details
-
Copy full SHA for 5d63fff - Browse repository at this point
Copy the full SHA 5d63fffView commit details
Commits on May 8, 2024
-
simplify embedding + first transformer block TP (pytorch#314)
as titled, we can directly specify the rowwise parallel embedding output layouts be shard on sequence dim, so that we don't need the first layer prepare input. Switching to output_layouts = Shard(1) would also trigger reduce_scatter instead of allreduce for embedding layer, which could give some small perf wins
Configuration menu - View commit details
-
Copy full SHA for 26ff44f - Browse repository at this point
Copy the full SHA 26ff44fView commit details
Commits on May 10, 2024
-
Only include checkpoints that have .metadata written (pytorch#315)
.metadata may be missing in some checkpoints if some ranks did not checkpoint properly. This PR filters out checkpoints that do not have .metadata in them.
Configuration menu - View commit details
-
Copy full SHA for ad46097 - Browse repository at this point
Copy the full SHA ad46097View commit details
Commits on May 13, 2024
-
Refactor freqs_cis slice to be safer for PP
Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e0734e53260366878dfe0fac5e1ab48f1d Pull Request resolved: pytorch#321
Configuration menu - View commit details
-
Copy full SHA for 99729e9 - Browse repository at this point
Copy the full SHA 99729e9View commit details -
Make Transformer tolerate missing layers for PP
A few small changes here lets manual PP frontend 'reconfigure' a whole transformer model to a stage's portion simply by setting undesired layers to None (in cases of top level layers) or deleting them from the ModuleDict (for 'layers.*'). These changes don't impact the FQNs of the remaining layers, which is critical for checkpoint load/save compatibility. ghstack-source-id: 48a7aafc89d86c3168f905560a4cd6bf4b5b9a12 Pull Request resolved: pytorch#322
Configuration menu - View commit details
-
Copy full SHA for 14d422f - Browse repository at this point
Copy the full SHA 14d422fView commit details
Commits on May 15, 2024
-
Use torch generic workflow for CI
ghstack-source-id: b1fa8d8c1778ecb532ed71792ead9f4dbb067cf4 Pull Request resolved: pytorch#325
Configuration menu - View commit details
-
Copy full SHA for ac94484 - Browse repository at this point
Copy the full SHA ac94484View commit details -
[checkpointing] import async checkpoint with pinned memory only when …
…needed ghstack-source-id: e460a8d6458f191f7f589fc908974f896b514690 Pull Request resolved: pytorch#333
Configuration menu - View commit details
-
Copy full SHA for 41d69d2 - Browse repository at this point
Copy the full SHA 41d69d2View commit details
Commits on May 16, 2024
-
Add a workflow to build torchtitan-ubuntu-20.04-clang12 Docker image …
…for CI (pytorch#338) Adopt from PyTorch, this workflow will prepare the Docker image `torchtitan-ubuntu-20.04-clang12` for the CI. * Base on https://hub.docker.com/layers/nvidia/cuda/12.1.0-cudnn8-runtime-ubuntu20.04/images/sha256-35d5a8eb50ad37fe707a7611a4e20414c5bd2f168adca0cf1700fe2d58411759 to include NVIDIA dependencies. * Install `dev-requirements.txt` and `requirements.txt`. I need to move these files from the top level to `.ci/docker` directory and create softlinks for them because docker build process will only take a look at `.ci/docker`. This is the reason why PyTorch keeps its CI requirements file there. * Install clang or gcc * Install conda (with python 3.11) `torchtitan-ubuntu-20.04-clang12` can then be used as the input for `docker-image`.
Configuration menu - View commit details
-
Copy full SHA for 6ed5237 - Browse repository at this point
Copy the full SHA 6ed5237View commit details
Commits on May 17, 2024
-
ghstack-source-id: 55302fd52dd6ee452c795e89170d0b1299218c87 Pull Request resolved: pytorch#342
Configuration menu - View commit details
-
Copy full SHA for 2dca85e - Browse repository at this point
Copy the full SHA 2dca85eView commit details -
Make test_runner.py warn on non-empty output dir
also wrap logic into functions and clean up global vars ghstack-source-id: 815c582011611a71005cc22bbd14310900465377 Pull Request resolved: pytorch#343
Configuration menu - View commit details
-
Copy full SHA for 3baba7b - Browse repository at this point
Copy the full SHA 3baba7bView commit details
Commits on May 21, 2024
-
Expose mixed_precision dtype arguments
add training.mixed_precision_param and .mixed_precision_reduce options refactor a util to map strings to torch dtypes ghstack-source-id: 387e1ca13ad23e859d21d7760f858ee6e269a796 Pull Request resolved: pytorch#348
Configuration menu - View commit details
-
Copy full SHA for 5c69c02 - Browse repository at this point
Copy the full SHA 5c69c02View commit details -
Use stateful dataloader to checkpoint data iteration order and token …
…buffer (pytorch#279) Summary: Use the stateful_dataloader from torchdata (https://github.com/pytorch/data/tree/main/torchdata/stateful_dataloader) for storing the token buffer and iteration data order. It requires a dependency on the nightly build of torchdata >= 20240426. Also make sure the dataloader state has a different key per rank. Test Plan: Tested locally by first running 30 steps (checkpointing every 5 steps) and capturing all the loss values. Then deleting the last 3 checkpoints and then re-run the training and the loss values from step 16-30 match with what we had earlier in the first run. Note that this requires changes in the train.py to enable a deterministic run. Reviewers: @tianyu-l Subscribers: @andrewkho Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 8cc0b38 - Browse repository at this point
Copy the full SHA 8cc0b38View commit details -
Add Pipeline Parallel (and 2D PP+FSDP) support
runs PP+DP and PP+TP without issue, runs PP+TP+DP with decreasing loss, but fails DCP save Supports only simple schedules currently, gpipe and 1f1b. Ads cmdline/toml arg for specifiying split points, in a unified way between tracer or manual frontend. e.g. user can specifiy "layers.2,layers.4" as split points. Currently uses manual frontend by default, but allows specifying tracer frontend. Tracer frontend requires working around additional compatibility limitations, indicated by raising assertions, and is not ready for wider use yet. ghstack-source-id: d7e0a1342bc97d6f1bba9e647234d90688ad708f Pull Request resolved: pytorch#318
Configuration menu - View commit details
-
Copy full SHA for aafe0e8 - Browse repository at this point
Copy the full SHA aafe0e8View commit details
Commits on May 22, 2024
-
fix i periodic integration test and add helper message on torchdata i…
…mport failure ghstack-source-id: 4db9ec111c83f7873253f19f0c95a997800e0f6b Pull Request resolved: pytorch#353
Configuration menu - View commit details
-
Copy full SHA for 60f58b9 - Browse repository at this point
Copy the full SHA 60f58b9View commit details -
torch.compile each TransformerBlock instead of the whole model (pytor…
…ch#268) This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later
Configuration menu - View commit details
-
Copy full SHA for 9954e19 - Browse repository at this point
Copy the full SHA 9954e19View commit details -
Make test_runner use separate logger with default INFO
previous change to use logging from torchtitan caused stdout not to show up. ghstack-source-id: 30a77c59ba68043ffa844be0443d5351d9584fab Pull Request resolved: pytorch#352
Configuration menu - View commit details
-
Copy full SHA for f47f442 - Browse repository at this point
Copy the full SHA f47f442View commit details -
Configuration menu - View commit details
-
Copy full SHA for 93a8053 - Browse repository at this point
Copy the full SHA 93a8053View commit details -
Fix bug in PP output layer shape
mostly harmless bug, since output shape of last layer is not used for send/recv purpose, the runtime value overrides it no matter what value you configured it with. However, since adding in/out shape validation to pipeline lib in torch, this raises an error and has to be fixed. ghstack-source-id: 950e41529b7b506085ab280d8a492e345eaefd24 Pull Request resolved: pytorch#354
Configuration menu - View commit details
-
Copy full SHA for 0afb276 - Browse repository at this point
Copy the full SHA 0afb276View commit details
Commits on May 23, 2024
-
Update pipelining import after change on pytorch
APIs conform to the pytorch rules. This PR should be able to land safely after tonight's nightly pytorch build which includes the above PR. ghstack-source-id: c575bc7835472128c09798544caa38bf1908e5ca Pull Request resolved: pytorch#356
Configuration menu - View commit details
-
Copy full SHA for c73a59d - Browse repository at this point
Copy the full SHA c73a59dView commit details
Commits on May 24, 2024
-
update .gitignore to screen out slew of new temp files (pytorch#359)
After updating today, I found a whole slew of various new temp files clogging up my source tab. This PR screens these out so that they don't accidentally get added in a PR and keeps your source tab change count correct. Example of issue without this PR: <img width="780" alt="Screenshot 2024-05-23 at 9 21 55 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/41b7061a-41a0-4a95-938b-3fd9292a2f38"> vs with this PR: <img width="661" alt="Screenshot 2024-05-23 at 10 07 16 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/cccf8c5f-368d-40a8-b10f-f11ca37df2bc">
Configuration menu - View commit details
-
Copy full SHA for c161119 - Browse repository at this point
Copy the full SHA c161119View commit details -
Add test for PP tracer frontend
- switch to using public PipelineStage API - clean up some asserts in tracer codepath ghstack-source-id: 2d069b7d45c4f3c788dec8fc85d8a7e83e463fcd Pull Request resolved: pytorch#357
Configuration menu - View commit details
-
Copy full SHA for e593e7d - Browse repository at this point
Copy the full SHA e593e7dView commit details
Commits on May 29, 2024
-
only produce tensorboard logs on rank 0 by default
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339
Configuration menu - View commit details
-
Copy full SHA for 0779207 - Browse repository at this point
Copy the full SHA 0779207View commit details -
replace old torch dependency in requirements.txt
ghstack-source-id: 8cbd62b97816ae8185b8a7e1aa9a7505f2780525 Pull Request resolved: pytorch#372
Configuration menu - View commit details
-
Copy full SHA for f6ea139 - Browse repository at this point
Copy the full SHA f6ea139View commit details
Commits on May 30, 2024
-
Add --test option to specify test to run (pytorch#368)
Usage: `--test <test_id>` Acceptable values: `test_id` in `build_test_list` (default: all) Example: ``` rm -rf outputs && python test_runner.py outputs --test pp_gpipe ```
Configuration menu - View commit details
-
Copy full SHA for 0fff2d2 - Browse repository at this point
Copy the full SHA 0fff2d2View commit details -
use integration test as the badge shown on the homepage
ghstack-source-id: 775591945ff5427cb7e5e9fc7592952b4c746341 Pull Request resolved: pytorch#373
Configuration menu - View commit details
-
Copy full SHA for 1877738 - Browse repository at this point
Copy the full SHA 1877738View commit details
Commits on May 31, 2024
-
keep only latest k checkpoints (pytorch#366)
Adds a config that purges old checkpoints. Useful for pretraining with frequent checkpointing and large step counts.
Configuration menu - View commit details
-
Copy full SHA for c48ae39 - Browse repository at this point
Copy the full SHA c48ae39View commit details
Commits on Jun 3, 2024
-
Make seed checkpoint creation work on CPU
ghstack-source-id: 4eb7a6e10812a11c5fd8589e2ff86e5bdb36f968 Pull Request resolved: pytorch#377
Configuration menu - View commit details
-
Copy full SHA for 3227d50 - Browse repository at this point
Copy the full SHA 3227d50View commit details -
ghstack-source-id: 9d52af302c797e9ac81f1113506f3bab261bf312 Pull Request resolved: pytorch#380
Configuration menu - View commit details
-
Copy full SHA for fbc4aa0 - Browse repository at this point
Copy the full SHA fbc4aa0View commit details -
Use general way to access and update submodules
ghstack-source-id: ba1d77e5825a26632fe9b7509a88b44509cac45f Pull Request resolved: pytorch#381
Configuration menu - View commit details
-
Copy full SHA for ff3c6e2 - Browse repository at this point
Copy the full SHA ff3c6e2View commit details
Commits on Jun 4, 2024
-
Make metrics logging work for pipeline parallelism
Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 7f60d1045f240327ae41ade3a353aff19d2f289a Pull Request resolved: pytorch#383
Configuration menu - View commit details
-
Copy full SHA for a1f9edb - Browse repository at this point
Copy the full SHA a1f9edbView commit details
Commits on Jun 5, 2024
-
[RFC] Allow ModelWrapper and OptimizerWrapper to accept multiple models
and optimizers ghstack-source-id: 190220813ece188728a3c776e6839a323009f719 Pull Request resolved: pytorch#360
Configuration menu - View commit details
-
Copy full SHA for 9d25778 - Browse repository at this point
Copy the full SHA 9d25778View commit details -
Enables PP+DP+TP and adds CI test case that runs on 8-gpu CI runner. ghstack-source-id: 7e2d6879d39e78fc7e6d46fd775bb6dfe08ff708 Pull Request resolved: pytorch#344
Configuration menu - View commit details
-
Copy full SHA for 4eb4bfc - Browse repository at this point
Copy the full SHA 4eb4bfcView commit details
Commits on Jun 6, 2024
-
[torchtitan][optim] Add fused as an option in train config (pytorch#355)
With these three PRs landed, we can now support the option fused=True in torchtitan for Adam and AdamW optimizer. pytorch/pytorch#125369 pytorch/pytorch#126423 pytorch/pytorch#126750 Run performance evaluation on 8 A100 DevGPU: 1000 steps on 1D DP default [llama_8b.toml](https://github.com/pytorch/torchtitan/blob/main/train_configs/llama3_8b.toml). Observation: For `fused = True` and `fused = False`, we observed similar loss curve and memory usage. wps is + ~100 and mfu is + 1.5-2% when fused = True. Below are the logs for the last 100 steps for both. ``` **Fused = False** [rank0]:2024-06-05 12:45:06,227 - root - INFO - Finished dumping traces in 0.37 seconds [rank0]:2024-06-05 12:45:37,677 - root - INFO - step: 910 loss: 4.6039 memory: 59.48GiB(75.15%) wps: 2,217 mfu: 41.16% [rank0]:2024-06-05 12:46:08,843 - root - INFO - step: 920 loss: 4.6427 memory: 59.48GiB(75.15%) wps: 2,632 mfu: 48.85% [rank0]:2024-06-05 12:46:40,052 - root - INFO - step: 930 loss: 4.6339 memory: 59.48GiB(75.15%) wps: 2,628 mfu: 48.78% [rank0]:2024-06-05 12:47:11,243 - root - INFO - step: 940 loss: 4.5964 memory: 59.48GiB(75.15%) wps: 2,631 mfu: 48.84% [rank0]:2024-06-05 12:47:42,655 - root - INFO - step: 950 loss: 4.6477 memory: 59.48GiB(75.15%) wps: 2,611 mfu: 48.47% [rank0]:2024-06-05 12:48:13,890 - root - INFO - step: 960 loss: 4.8137 memory: 59.48GiB(75.15%) wps: 2,626 mfu: 48.75% [rank0]:2024-06-05 12:48:45,110 - root - INFO - step: 970 loss: 4.5962 memory: 59.48GiB(75.15%) wps: 2,628 mfu: 48.78% [rank0]:2024-06-05 12:49:16,333 - root - INFO - step: 980 loss: 4.5450 memory: 59.48GiB(75.15%) wps: 2,627 mfu: 48.76% [rank0]:2024-06-05 12:49:47,561 - root - INFO - step: 990 loss: 4.5840 memory: 59.48GiB(75.15%) wps: 2,627 mfu: 48.76% [rank0]:2024-06-05 12:50:18,933 - root - INFO - step: 1000 loss: 4.5351 memory: 59.48GiB(75.15%) wps: 2,615 mfu: 48.53% [rank0]:2024-06-05 12:50:23,692 - root - INFO - Dumping traces at step 1000 [rank0]:2024-06-05 12:50:24,041 - root - INFO - Finished dumping traces in 0.35 seconds [rank0]:2024-06-05 12:50:24,422 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:2024-06-05 12:50:26,424 - root - INFO - Training completed **Fused = True** [rank0]:2024-06-05 14:55:42,894 - root - INFO - Finished dumping traces in 0.30 seconds [rank0]:2024-06-05 14:56:13,582 - root - INFO - step: 910 loss: 4.6091 memory: 59.48GiB(75.15%) wps: 2,341 mfu: 43.46% [rank0]:2024-06-05 14:56:43,765 - root - INFO - step: 920 loss: 4.6468 memory: 59.48GiB(75.15%) wps: 2,718 mfu: 50.45% [rank0]:2024-06-05 14:57:13,971 - root - INFO - step: 930 loss: 4.6365 memory: 59.48GiB(75.15%) wps: 2,715 mfu: 50.40% [rank0]:2024-06-05 14:57:44,172 - root - INFO - step: 940 loss: 4.6021 memory: 59.48GiB(75.15%) wps: 2,716 mfu: 50.41% [rank0]:2024-06-05 14:58:14,353 - root - INFO - step: 950 loss: 4.6522 memory: 59.48GiB(75.15%) wps: 2,718 mfu: 50.45% [rank0]:2024-06-05 14:58:44,536 - root - INFO - step: 960 loss: 4.8163 memory: 59.48GiB(75.15%) wps: 2,717 mfu: 50.44% [rank0]:2024-06-05 14:59:14,683 - root - INFO - step: 970 loss: 4.6026 memory: 59.48GiB(75.15%) wps: 2,721 mfu: 50.51% [rank0]:2024-06-05 14:59:44,840 - root - INFO - step: 980 loss: 4.5491 memory: 59.48GiB(75.15%) wps: 2,720 mfu: 50.49% [rank0]:2024-06-05 15:00:15,009 - root - INFO - step: 990 loss: 4.5859 memory: 59.48GiB(75.15%) wps: 2,719 mfu: 50.47% [rank0]:2024-06-05 15:00:45,228 - root - INFO - step: 1000 loss: 4.5396 memory: 59.48GiB(75.15%) wps: 2,714 mfu: 50.38% [rank0]:2024-06-05 15:00:49,455 - root - INFO - Dumping traces at step 1000 [rank0]:2024-06-05 15:00:49,756 - root - INFO - Finished dumping traces in 0.30 seconds [rank0]:2024-06-05 15:00:50,336 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:2024-06-05 15:00:52,339 - root - INFO - Training completed ```
Configuration menu - View commit details
-
Copy full SHA for 40f8fd0 - Browse repository at this point
Copy the full SHA 40f8fd0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3bbe3d9 - Browse repository at this point
Copy the full SHA 3bbe3d9View commit details
Commits on Jun 7, 2024
-
Abstract out out optimizer params and update foreach calling conventi…
…on (pytorch#386) # Summary Updates the behavior to call foreach when we are not using fused for the optimizer
Configuration menu - View commit details
-
Copy full SHA for d953107 - Browse repository at this point
Copy the full SHA d953107View commit details
Commits on Jun 9, 2024
-
DeviceMesh BC fix (pytorch#387)
fix BC issues There's another pipeline bc issue :(
Configuration menu - View commit details
-
Copy full SHA for cf37b61 - Browse repository at this point
Copy the full SHA cf37b61View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9acdc6f - Browse repository at this point
Copy the full SHA 9acdc6fView commit details
Commits on Jun 10, 2024
-
ghstack-source-id: ac3501485faa093c8b9daacca9917805e2a987b7 Pull Request resolved: pytorch#389
Configuration menu - View commit details
-
Copy full SHA for 3e5c0aa - Browse repository at this point
Copy the full SHA 3e5c0aaView commit details -
add the 8-gpu test badge and use correct links for the integration te…
…st badges ghstack-source-id: f198ee40b0d7ee9409feb8fb9539a73b822d756c Pull Request resolved: pytorch#390
Configuration menu - View commit details
-
Copy full SHA for 032b9d1 - Browse repository at this point
Copy the full SHA 032b9d1View commit details
Commits on Jun 11, 2024
-
forgot to enable tracer for tracer test in the last PR ghstack-source-id: 1cb137911f88daa47b57757346dad55aa736429e Pull Request resolved: pytorch#362
Configuration menu - View commit details
-
Copy full SHA for 91937ef - Browse repository at this point
Copy the full SHA 91937efView commit details
Commits on Jun 12, 2024
-
del logits=(bs, seq_len, vocab_size) to save 3.9G memory (pytorch#391)
logits=(bs, seq_len, vocab_size). call `del logits` to free it before backward <img width="1607" alt="Screenshot 2024-06-12 at 11 10 36 AM" src="https://github.com/pytorch/torchtitan/assets/134637289/82db2792-59a3-40c4-9591-842be3dd9284">
Configuration menu - View commit details
-
Copy full SHA for e29b6b4 - Browse repository at this point
Copy the full SHA e29b6b4View commit details -
Update contributing.md (pytorch#385)
small update for contributing.md to include what packages to install and how to lint.
Configuration menu - View commit details
-
Copy full SHA for d0b4092 - Browse repository at this point
Copy the full SHA d0b4092View commit details -
Configuration menu - View commit details
-
Copy full SHA for 000d43f - Browse repository at this point
Copy the full SHA 000d43fView commit details
Commits on Jun 13, 2024
-
enable TP fp8 allgather with PrepareFloat8ModuleInput (pytorch#393)
This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI or trains that does not use fp8
Configuration menu - View commit details
-
Copy full SHA for 7fcf70d - Browse repository at this point
Copy the full SHA 7fcf70dView commit details -
Configuration menu - View commit details
-
Copy full SHA for a6b585f - Browse repository at this point
Copy the full SHA a6b585fView commit details -
Fix SAC BC breaking and renaming to ac_freq (pytorch#397)
as titled, SAC moved to a different public API, move to the new API to avoid CI breaking
Configuration menu - View commit details
-
Copy full SHA for 0bf344c - Browse repository at this point
Copy the full SHA 0bf344cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 230300b - Browse repository at this point
Copy the full SHA 230300bView commit details
Commits on Jun 14, 2024
-
enable TritonFusedRMSNorm with local_map annotation (pytorch#404)
Summary This PR enables the use of TritonFusedRMSNorm with Tensor Parallel with 7%-8% performance gain compared to RMSNorm with TP. pytorch#364
Configuration menu - View commit details
-
Copy full SHA for 38496a3 - Browse repository at this point
Copy the full SHA 38496a3View commit details -
ghstack-source-id: ce4a5b0b6b785ce595487c9d565a8af030c9d07b Pull Request resolved: pytorch#398
Configuration menu - View commit details
-
Copy full SHA for e99f237 - Browse repository at this point
Copy the full SHA e99f237View commit details -
Break down parallelize_llama for inference cases
ghstack-source-id: fc8e221b5047337f59dea31f2c51d6168fe4fe88 Pull Request resolved: pytorch#402
Configuration menu - View commit details
-
Copy full SHA for a96fb82 - Browse repository at this point
Copy the full SHA a96fb82View commit details
Commits on Jun 17, 2024
-
Change debugmodel to have 8 layers
- make it possible to choose flavor per-test from test_runner.py This is useful for PP when more layers == more possibilities for schedules/num_stages, but we don't care about having a large model in terms of #parameters ghstack-source-id: fd3076ad591b4f51dd195a78bab5dbe2e4276b18 Pull Request resolved: pytorch#403
Configuration menu - View commit details
-
Copy full SHA for ae3d2a9 - Browse repository at this point
Copy the full SHA ae3d2a9View commit details
Commits on Jun 18, 2024
-
Prepare train.py for model chunks for pipelining
When using pipeline parallelism, a common technique for reducing bubble size is to use schedules that specify more than one model chunk per physical rank. e.g. pp degree 4 could have 8 pipeline stages, and rank 0 could have stage 0 and stage 4. To generalize this concept without forking too much code in train.py, I make 'model_parts' a new container that either contains one model for non-PP or simple PP cases, and contains multiple model parts for complex PP cases. In general, this is tractable becuase we treat each model part the same: we create one optimizer per model part, and one lr scheduler per optimizer. We apply spmd and compile individually to each model part. The general pattern is to loop over the model parts and perform an action on each part, which also works fine if the list size is 1. The rest of train.py and optimizer/lr_scheduler changes add syntax sugar to simplify calling a method on each model part or optimizer part. ghstack-source-id: fd2982baae0cbeb5dcb695ef6509b7eec3299d6b Pull Request resolved: pytorch#406
Configuration menu - View commit details
-
Copy full SHA for f8e17f1 - Browse repository at this point
Copy the full SHA f8e17f1View commit details
Commits on Jun 19, 2024
-
dump memory snapshot to analyze OOMs (pytorch#395)
when setting `enable_memory_snapshot = true` in `.toml` * dump memory snapshots in case of OOMs. output folder is `memory_snapshot/iteration_x_exit` * dump regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x` * existing `.toml` works since `enable_memory_snapshot=False` by default snapshot is an example of the dump when OOM happens <img width="1640" alt="Screenshot 2024-06-12 at 9 26 53 PM" src="https://github.com/pytorch/torchtitan/assets/134637289/6420799c-ae68-4b35-b8bb-f5b6ab3dd053">
Configuration menu - View commit details
-
Copy full SHA for 71b70b5 - Browse repository at this point
Copy the full SHA 71b70b5View commit details
Commits on Jun 20, 2024
-
whole_model for fp8 (pytorch#414)
train.py renamed `model` to `whole_model` pytorch#406 fp8 still use `model` thus report error on `model not defined`. this PR fixed it `build_fp8_linear(whole_model, job_config)`
Configuration menu - View commit details
-
Copy full SHA for 6117759 - Browse repository at this point
Copy the full SHA 6117759View commit details
Commits on Jun 21, 2024
-
Add train loop support for looped PP schedules
- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: pytorch#358
Configuration menu - View commit details
-
Copy full SHA for 04661a6 - Browse repository at this point
Copy the full SHA 04661a6View commit details
Commits on Jun 25, 2024
-
Set
record_shapes=True
for profilerghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030 Pull Request resolved: pytorch#419
Configuration menu - View commit details
-
Copy full SHA for b1340a1 - Browse repository at this point
Copy the full SHA b1340a1View commit details -
ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60 Pull Request resolved: pytorch#418
Configuration menu - View commit details
-
Copy full SHA for be126a6 - Browse repository at this point
Copy the full SHA be126a6View commit details -
Adding FSDP Memory Tracking and Estimation
ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d Pull Request resolved: pytorch#425
Configuration menu - View commit details
-
Copy full SHA for 342a07e - Browse repository at this point
Copy the full SHA 342a07eView commit details -
Adding integration test for FSDP Memory Tracking and Estimation
ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454 Pull Request resolved: pytorch#426
Configuration menu - View commit details
-
Copy full SHA for 134addd - Browse repository at this point
Copy the full SHA 134adddView commit details
Commits on Jun 26, 2024
-
by default disable heavy memory profiling
ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187 Pull Request resolved: pytorch#430
Configuration menu - View commit details
-
Copy full SHA for f5171cb - Browse repository at this point
Copy the full SHA f5171cbView commit details
Commits on Jun 27, 2024
-
Add the option to turn on async-TP
ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf Pull Request resolved: pytorch#429
Configuration menu - View commit details
-
Copy full SHA for 1ec2ece - Browse repository at this point
Copy the full SHA 1ec2eceView commit details
Commits on Jul 1, 2024
-
Modifying memory estimation options and minor changes
ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4 Pull Request resolved: pytorch#435
Configuration menu - View commit details
-
Copy full SHA for 64d47fd - Browse repository at this point
Copy the full SHA 64d47fdView commit details
Commits on Jul 8, 2024
-
add comment pointing to Sequence Parallel optimization example
ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234 Pull Request resolved: pytorch#438
Configuration menu - View commit details
-
Copy full SHA for 6655204 - Browse repository at this point
Copy the full SHA 6655204View commit details -
switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch…
…#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 8a1aa06 - Browse repository at this point
Copy the full SHA 8a1aa06View commit details
Commits on Jul 10, 2024
-
Removed
_experimental_support_context_fn_in_torch_utils_checkpoint
ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f Pull Request resolved: pytorch#444
Configuration menu - View commit details
-
Copy full SHA for 28762c8 - Browse repository at this point
Copy the full SHA 28762c8View commit details -
Reordered TP parallel plan to follow execution order
ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422 Pull Request resolved: pytorch#445
Configuration menu - View commit details
-
Copy full SHA for 064730a - Browse repository at this point
Copy the full SHA 064730aView commit details -
Made some stylistic changes to
apply_dp
ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f Pull Request resolved: pytorch#446
Configuration menu - View commit details
-
Copy full SHA for 3e3a913 - Browse repository at this point
Copy the full SHA 3e3a913View commit details -
Refactored activation checkpointing
ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d Pull Request resolved: pytorch#447
Configuration menu - View commit details
-
Copy full SHA for 347ddc0 - Browse repository at this point
Copy the full SHA 347ddc0View commit details -
ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3 Pull Request resolved: pytorch#442
Configuration menu - View commit details
-
Copy full SHA for 3ff7fbb - Browse repository at this point
Copy the full SHA 3ff7fbbView commit details
Commits on Jul 11, 2024
-
Renamed parallel styles for transformer block weights
ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e Pull Request resolved: pytorch#448
Configuration menu - View commit details
-
Copy full SHA for 562d7e2 - Browse repository at this point
Copy the full SHA 562d7e2View commit details -
Added type annotations and more stylistic changes
ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5 Pull Request resolved: pytorch#449
Configuration menu - View commit details
-
Copy full SHA for 0ddf49b - Browse repository at this point
Copy the full SHA 0ddf49bView commit details
Commits on Jul 15, 2024
-
[Cleanup] Remove libuv from run_llama_train.sh
libuv is now enabled by default. we can proably do without the educational blurb there, and don't need the env either since the default has landed. ghstack-source-id: 68c8d2abe7eb0777e2add8df7634367c31b7ec06 Pull Request resolved: pytorch#453
Configuration menu - View commit details
-
Copy full SHA for 535acf6 - Browse repository at this point
Copy the full SHA 535acf6View commit details -
[Cleanup] Organize run_llama_train.sh options
Just a little code motion but it looks cleaner to me this way ghstack-source-id: 055fbd557cd9cf189e6b9bd6a7048f1204e1dc5c Pull Request resolved: pytorch#454
Configuration menu - View commit details
-
Copy full SHA for ac72078 - Browse repository at this point
Copy the full SHA ac72078View commit details -
[Cleanup] Split run_llama_train.sh and run_memory_estimation.sh
Make each script simpler to read ghstack-source-id: ba3aa65feb6e304736c73daf5bc8ab5fb254f196 Pull Request resolved: pytorch#455
Configuration menu - View commit details
-
Copy full SHA for 4b6cdc1 - Browse repository at this point
Copy the full SHA 4b6cdc1View commit details -
[Cleanup] Remove unused TRAINER_DIR
This argument seems to be left over from older times- it is not used anywhere in the codebase. ghstack-source-id: abbcf82ed4d1b8fbb71c6a6b48acbc1296dbec64 Pull Request resolved: pytorch#456
Configuration menu - View commit details
-
Copy full SHA for 8fa11f0 - Browse repository at this point
Copy the full SHA 8fa11f0View commit details -
Add educational code pointers to top level README
ghstack-source-id: 522aa2fa0bf1679f55d9f3a8a38fdcd319d5e3df Pull Request resolved: pytorch#457
Configuration menu - View commit details
-
Copy full SHA for 174c44a - Browse repository at this point
Copy the full SHA 174c44aView commit details
Commits on Jul 16, 2024
-
enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather (pytorch#413)
we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files
Configuration menu - View commit details
-
Copy full SHA for a4b2ee3 - Browse repository at this point
Copy the full SHA a4b2ee3View commit details
Commits on Jul 17, 2024
-
import float8_experimental only when fp8 is enabled and install it in…
… CI (pytorch#464) make sure to only import float8_experimental when fp8 is enabled for 4 gpu CI, make sure we can import float8_experimental correctly in CI `python -m pip install git+https://github.com/pytorch-labs/float8_experimental.git`
Configuration menu - View commit details
-
Copy full SHA for ae8181b - Browse repository at this point
Copy the full SHA ae8181bView commit details -
skip fp8 CI on non-H100 GPUs (pytorch#465)
skip fp8 tests on non-H100 GPUs by checking `torch.cuda.get_device_capability() >= (9, 0)` this makes 4 GPU CI healthy again
Configuration menu - View commit details
-
Copy full SHA for 3760bcf - Browse repository at this point
Copy the full SHA 3760bcfView commit details -
clean up float8 configs in torchtitan (pytorch#466)
Summary: 1. standardizes on `float8` instead of `fp8` for config names 2. removes usage of non-public objects such as `Float8Linear` Test Plan: ``` with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear ``` Reviewers: Subscribers: Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 69fe8de - Browse repository at this point
Copy the full SHA 69fe8deView commit details
Commits on Jul 18, 2024
-
Add support of DDP and experimental CompiledAutograd
Summary: Address the comments in pytorch#319 and resubmit the PR to fit the current code base. Test Plan: ``` CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600 --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000 ``` ghstack-source-id: 81dc85d42df13df4ed727bebd825681879af936b Pull Request resolved: pytorch#432
Configuration menu - View commit details
-
Copy full SHA for 2f989b9 - Browse repository at this point
Copy the full SHA 2f989b9View commit details
Commits on Jul 19, 2024
-
add torch.compile + FSDP2 float8 all-gather in CI (pytorch#468)
fixed my bug in float8_experimental. now we can torch.compile transfromer blocks with FSDP float8 all-gather pytorch-labs/float8_experimental#321 local test: `CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.compile` profiler traces: I can see compiled region in cpu thread and float8 malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM" src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129">
Configuration menu - View commit details
-
Copy full SHA for 71b8eae - Browse repository at this point
Copy the full SHA 71b8eaeView commit details -
[float8] keep model.output as
nn.Linear
(high precision, not fp8) (p……ytorch#469) **keep model.output as nn.Linear**: it's a common practice to NOT apply fp8 on final output layer * specify `skip_fqn_list` in swapping * when applying TP to model.output, use plain `ColwiseParallel` instead of `Float8ColwiseParallel` credit to @awgu, we do not need tokentizer vacab size to be divisible by 16 pytorch#461 1D TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4` 1D TP + float8 all-gather, compile mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 --training.compile` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2` 2D FSDP2 + TP + float8 all-gather, eager mode: `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 ./run_llama_train.sh --training.enable_float8_linear --training.enable_fsdp_float8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp --training.tensor_parallel_degree 2 --training.compile` 1D TP + float8 all-gather trace: see float8 and all-gather in the trace <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM" src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472"> 2D + float8 all-gather trace: see float8 and FSDP collectives and TP collectives <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM" src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7">
Configuration menu - View commit details
-
Copy full SHA for 0c6f9a2 - Browse repository at this point
Copy the full SHA 0c6f9a2View commit details
Commits on Jul 20, 2024
-
remove CI for FSDP2 + fp8 all-gather (pytorch#470)
per discussion from pytorch#469 (comment) we are planning BC breaking changes in float8_experimental. remove CI for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we can discuss bringing it back
Configuration menu - View commit details
-
Copy full SHA for 0a17c26 - Browse repository at this point
Copy the full SHA 0a17c26View commit details
Commits on Jul 21, 2024
-
dynamically update torch.compile cache config to ensure async tp supp…
…ort, enhance async tp UX (pytorch#471) This PR adds some enhancements for supporting async tp: 1 - if async tp is active, auto updates the torch.dynamo cache limit to 10K. If this is not updated, async tp will not be activated on larger models as it will quietly stop compilation due to 'cache limit reached' with no info for the user. This config update is logged. 2 - if async tp is enabled, verifies that torch.compile is set to true for this job config. If not, it warns and then activates torch.compile to ensure user gets working async tp. (see WARNING in below screenshot) <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM" src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d"> 3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied Async Tensor Parallel' when async tp is active to make it clear in the logs which TP is active. (see above screenshot)
Configuration menu - View commit details
-
Copy full SHA for 0ee573c - Browse repository at this point
Copy the full SHA 0ee573cView commit details
Commits on Jul 26, 2024
-
Fix 8gpu PP failure due to 2D DCP disablement
DCP recently added safeties to avoid using it for 2D/3D since strided sharding (a feature needed for safe 2D/3D resharding) is not ready yet. PP uses DCP to load a seed checkpoint. Disabling the safety mechanism is enough to make 3D/PP still work (for the case where we train from the beginning or do not re-shard. (Resharding refers to saving a checkpoint from one world size/parallelism config and loading/resuming under a different one). ghstack-source-id: c069d2186c79517c72f5b3c99485cebdc15df08f Pull Request resolved: pytorch#460
Configuration menu - View commit details
-
Copy full SHA for 69c9bb2 - Browse repository at this point
Copy the full SHA 69c9bb2View commit details -
update float8 integration after UX changes (pytorch#484)
Summary: float8_experimental landed various BC-breaking UX changes last week. This PR updates torchtitan to work with the version of float8_experimental after pytorch-labs/float8_experimental#332 and pytorch-labs/float8_experimental#337 Test Plan: ``` with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 90e2070 - Browse repository at this point
Copy the full SHA 90e2070View commit details -
Re-enable FSDP2 Mem Tracker integration tests
ghstack-source-id: 8344603f7a5596cb2909c9bf04dd1b9e4730c9b8 Pull Request resolved: pytorch#485
Sanket Jayant Purandare committedJul 26, 2024 Configuration menu - View commit details
-
Copy full SHA for 42f4ff5 - Browse repository at this point
Copy the full SHA 42f4ff5View commit details
Commits on Jul 29, 2024
-
Used
partial
instead of global vars for LR schedulingghstack-source-id: 12c4418b0574d93e1441f4ca3d1de79c8aad7a40 Pull Request resolved: pytorch#487
Configuration menu - View commit details
-
Copy full SHA for a48de09 - Browse repository at this point
Copy the full SHA a48de09View commit details
Commits on Jul 30, 2024
-
[EZ] Add logs for some basic training params so that we can verify in… (
pytorch#491) As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786">
Configuration menu - View commit details
-
Copy full SHA for b63e209 - Browse repository at this point
Copy the full SHA b63e209View commit details -
make float8 scaling type configurable (pytorch#489)
Summary: Adds config options to configure float8 scaling type for input, weight, grad_output. Performance is not ideal yet, but that's because we have not optimized it. Test Plan: ``` // repeat for input, weight, grad_out with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile ``` Reviewers: Subscribers: Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 91f075a - Browse repository at this point
Copy the full SHA 91f075aView commit details -
[PP] add flexible interleaved 1f1b schedule pytorch#490 (pytorch#493)
This was approved in pytorch#490, but merged into the wrong branch, merging this into main
Configuration menu - View commit details
-
Copy full SHA for 9358d70 - Browse repository at this point
Copy the full SHA 9358d70View commit details -
move float8 callsites to torchao.float8 (pytorch#492)
Summary: The `float8_experimental` repository moved to `torchao.float8` in pytorch/ao#551 This PR updates `torchtitan` to use float8 from the new location. Test Plan: ``` with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile ``` Reviewers: Subscribers: Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 239d56f - Browse repository at this point
Copy the full SHA 239d56fView commit details
Commits on Aug 1, 2024
-
ghstack-source-id: 3879e764e7b33afde5d778810c71d1d2a8f82f6d Pull Request resolved: pytorch#494
Configuration menu - View commit details
-
Copy full SHA for 3c77e9f - Browse repository at this point
Copy the full SHA 3c77e9fView commit details -
[BE][2/n] use proper method signatures in parallelize_llama
ghstack-source-id: 17a1ee9f03f13423a30183c5c8d7ad30f8c8dbfc Pull Request resolved: pytorch#495
Configuration menu - View commit details
-
Copy full SHA for bf90710 - Browse repository at this point
Copy the full SHA bf90710View commit details -
[BE][3/n] wrap fp8 logic using Float8Handler
ghstack-source-id: e94c7f6f4fad87c5432262c54beabd02de5541b8 Pull Request resolved: pytorch#496
Configuration menu - View commit details
-
Copy full SHA for 40f79d7 - Browse repository at this point
Copy the full SHA 40f79d7View commit details -
Bring LLaMa 3.1 405B to TorchTitan family (pytorch#481)
With the official launch of LLaMa 3.1 model, we want to add the config to TorchTitan. Of course, there are more work to be done, but we want to go an incremental way. So more PRs will be needed. For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The perf number is wps: 109 mfu: 29%. Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4). <img width="1037" alt="image" src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e"> Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4). ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0)
Configuration menu - View commit details
-
Copy full SHA for 4871358 - Browse repository at this point
Copy the full SHA 4871358View commit details
Commits on Aug 2, 2024
-
[TP] Infer local n_heads instead of ad-hoc model changes
ghstack-source-id: 587e3d6e5270714ca734b8031ce41a962e6394ea Pull Request resolved: pytorch#498
Configuration menu - View commit details
-
Copy full SHA for d41d604 - Browse repository at this point
Copy the full SHA d41d604View commit details
Commits on Aug 3, 2024
-
ghstack-source-id: 63af8025c184fd5ad34f2f57bf78a37dda2cd33d Pull Request resolved: pytorch#443
Configuration menu - View commit details
-
Copy full SHA for 24aef32 - Browse repository at this point
Copy the full SHA 24aef32View commit details
Commits on Aug 5, 2024
-
[EZ][405B] Use scientific notation for 405B model lr (pytorch#504)
As title, use `8e-5` rather than `0.8e-4`.
Configuration menu - View commit details
-
Copy full SHA for c44cca0 - Browse repository at this point
Copy the full SHA c44cca0View commit details -
[BE][4/n] split pipeline_llama into a separate file
ghstack-source-id: 5ebb4adf3152f413fa33a923c272c9aa3ce1f775 Pull Request resolved: pytorch#499
Configuration menu - View commit details
-
Copy full SHA for 8849580 - Browse repository at this point
Copy the full SHA 8849580View commit details -
[fix] float8 should be applied on all model_parts
ghstack-source-id: 52ed6836de39e82c4c5824a40ecfc1d9ec7ed2bd Pull Request resolved: pytorch#500
Configuration menu - View commit details
-
Copy full SHA for a4d88d1 - Browse repository at this point
Copy the full SHA a4d88d1View commit details
Commits on Aug 6, 2024
-
Add warning to compile rmsnorm (pytorch#505)
as titled, add warning to compile rmsnorm as it's not fully ready yet, i.e. this issue pytorch#497 We can remove this warning once we fix the issue
Configuration menu - View commit details
-
Copy full SHA for 1a303b3 - Browse repository at this point
Copy the full SHA 1a303b3View commit details
Commits on Aug 7, 2024
-
add float8 to README (pytorch#509)
add float8 link in README so we can redirect people from dev-discuss post to torchtitan repo README looks like this after rendering <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM" src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4"> float8.md looks like this <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM" src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4"> I tried the command locally and traces are looking good <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM" src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb">
Configuration menu - View commit details
-
Copy full SHA for b99bc5e - Browse repository at this point
Copy the full SHA b99bc5eView commit details -
address TODOs as 2D recompiles is fixed
ghstack-source-id: 2927f0a8082171da3e9f59a5d04f8325cbdf3653 Pull Request resolved: pytorch#508
Configuration menu - View commit details
-
Copy full SHA for fa8cdd4 - Browse repository at this point
Copy the full SHA fa8cdd4View commit details
Commits on Aug 8, 2024
-
[BE][5/n] simply pp vs. non-pp set up
ghstack-source-id: 003bfbfbcf1511ddbd18e15d031b39f597d8e7db Pull Request resolved: pytorch#510
Configuration menu - View commit details
-
Copy full SHA for d6e3f77 - Browse repository at this point
Copy the full SHA d6e3f77View commit details -
[BE][6/n] replace large c4_mini datasets by c4_test with the first 2K…
… entries ghstack-source-id: 319f4961b092778703101b98937803073132afa1 Pull Request resolved: pytorch#512
Configuration menu - View commit details
-
Copy full SHA for 34fa017 - Browse repository at this point
Copy the full SHA 34fa017View commit details
Commits on Aug 9, 2024
-
Create composability.md (pytorch#511)
Explain the rationale and challenges behind certain changes we made to llama model to support 3D parallelism. --------- Co-authored-by: tianyu-l <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 9de54a5 - Browse repository at this point
Copy the full SHA 9de54a5View commit details -
depend on torchdata 0.8.0 instead of nightly
ghstack-source-id: 1965d3122885fed3c28e2e058c55581187e7816c Pull Request resolved: pytorch#513
Configuration menu - View commit details
-
Copy full SHA for b41b41b - Browse repository at this point
Copy the full SHA b41b41bView commit details
Commits on Aug 12, 2024
-
[PP] Bypass seed checkpoint my init-ing model parts separately (pytor…
…ch#516) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * pytorch#473 * pytorch#517 * __->__ pytorch#516 Allows PP to be used without a seed checkpoint by calling `init_weight` on each model part. This is the solution in step 1 of pytorch#514 proposed by @wconstab
Configuration menu - View commit details
-
Copy full SHA for a4bc948 - Browse repository at this point
Copy the full SHA a4bc948View commit details -
[small] format composability.md (pytorch#517)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * pytorch#473 * __->__ pytorch#517 * pytorch#516 Ran `pre-commit run --all-files`
Configuration menu - View commit details
-
Copy full SHA for a47a5a9 - Browse repository at this point
Copy the full SHA a47a5a9View commit details
Commits on Aug 13, 2024
-
Throw warning if users are using old pytorch version that not includi…
…ng DTensor strided sharding (pytorch#507) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#507 **Summary** 1. check if users are using new nightly-build pytorch that includes DTensor strided sharding (pytorch/pytorch#130760) when 2D/3D is used. Print warning if not. 2. remove temporary re-enablement added in pytorch#460 . **Test** Command: `python test_runner.py outputs --test pp_dp_tp --ngpu 8` GPUs: A100 Output: - without strided sharding: ``` [rank7]:2024-08-06 03:21:26,706 - root - INFO - step: 2 loss: 8.1652 memory: 0.51GiB(0.64%) wps: 8,250 mfu: 0.25% [rank7]:2024-08-06 03:21:27,013 - root - INFO - step: 3 loss: 8.0951 memory: 0.51GiB(0.64%) wps: 13,358 mfu: 0.41% [rank7]:2024-08-06 03:21:27,309 - root - INFO - step: 4 loss: 7.9748 memory: 0.51GiB(0.64%) wps: 13,865 mfu: 0.42% [rank7]:2024-08-06 03:21:27,582 - root - INFO - step: 5 loss: 7.8025 memory: 0.51GiB(0.64%) wps: 15,057 mfu: 0.46% [rank7]:2024-08-06 03:21:28,076 - root - INFO - step: 6 loss: 7.5612 memory: 0.51GiB(0.64%) wps: 8,300 mfu: 0.25% [rank7]:2024-08-06 03:21:28,608 - root - INFO - step: 7 loss: 7.3649 memory: 0.51GiB(0.64%) wps: 7,705 mfu: 0.23% [rank7]:2024-08-06 03:21:28,927 - root - INFO - step: 8 loss: 7.2946 memory: 0.51GiB(0.64%) wps: 12,832 mfu: 0.39% [rank7]:2024-08-06 03:21:29,251 - root - INFO - step: 9 loss: 7.1311 memory: 0.51GiB(0.64%) wps: 12,669 mfu: 0.38% [rank7]:2024-08-06 03:21:29,627 - root - INFO - step: 10 loss: 7.0540 memory: 0.51GiB(0.64%) wps: 10,918 mfu: 0.33% >>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<< [rank7]:2024-08-06 03:21:59,723 - root - INFO - step: 11 loss: 7.0822 memory: 0.51GiB(0.64%) wps: 1,139 mfu: 0.03% [rank7]:2024-08-06 03:22:00,054 - root - INFO - step: 12 loss: 7.0508 memory: 0.51GiB(0.64%) wps: 12,366 mfu: 0.38% [rank7]:2024-08-06 03:22:00,340 - root - INFO - step: 13 loss: 6.9182 memory: 0.51GiB(0.64%) wps: 14,370 mfu: 0.44% [rank7]:2024-08-06 03:22:00,624 - root - INFO - step: 14 loss: 6.8948 memory: 0.51GiB(0.64%) wps: 14,442 mfu: 0.44% [rank7]:2024-08-06 03:22:00,907 - root - INFO - step: 15 loss: 6.8358 memory: 0.51GiB(0.64%) wps: 14,514 mfu: 0.44% [rank7]:2024-08-06 03:22:01,574 - root - INFO - step: 16 loss: 6.7653 memory: 0.51GiB(0.64%) wps: 6,144 mfu: 0.19% [rank7]:2024-08-06 03:22:02,209 - root - INFO - step: 17 loss: 6.7340 memory: 0.51GiB(0.64%) wps: 6,453 mfu: 0.20% [rank7]:2024-08-06 03:22:02,532 - root - INFO - step: 18 loss: 6.6874 memory: 0.51GiB(0.64%) wps: 12,695 mfu: 0.39% [rank7]:2024-08-06 03:22:02,863 - root - INFO - step: 19 loss: 6.6566 memory: 0.51GiB(0.64%) wps: 12,406 mfu: 0.38% [rank7]:2024-08-06 03:22:03,257 - root - INFO - step: 20 loss: 6.6629 memory: 0.51GiB(0.64%) wps: 10,392 mfu: 0.32% ``` - with strided sharding ``` [rank7]:2024-08-06 03:26:18,288 - root - INFO - step: 1 loss: 8.2069 memory: 0.50GiB(0.63%) wps: 915 mfu: 0.03% [rank7]:2024-08-06 03:26:19,084 - root - INFO - step: 2 loss: 8.1913 memory: 0.51GiB(0.64%) wps: 5,144 mfu: 0.16% [rank7]:2024-08-06 03:26:19,365 - root - INFO - step: 3 loss: 8.1148 memory: 0.51GiB(0.64%) wps: 14,593 mfu: 0.44% [rank7]:2024-08-06 03:26:19,698 - root - INFO - step: 4 loss: 7.9982 memory: 0.51GiB(0.64%) wps: 12,328 mfu: 0.37% [rank7]:2024-08-06 03:26:20,011 - root - INFO - step: 5 loss: 7.8382 memory: 0.51GiB(0.64%) wps: 13,100 mfu: 0.40% [rank7]:2024-08-06 03:26:20,498 - root - INFO - step: 6 loss: 7.6293 memory: 0.51GiB(0.64%) wps: 8,423 mfu: 0.26% [rank7]:2024-08-06 03:26:21,126 - root - INFO - step: 7 loss: 7.4454 memory: 0.51GiB(0.64%) wps: 6,530 mfu: 0.20% [rank7]:2024-08-06 03:26:21,472 - root - INFO - step: 8 loss: 7.3337 memory: 0.51GiB(0.64%) wps: 11,843 mfu: 0.36% [rank7]:2024-08-06 03:26:21,849 - root - INFO - step: 9 loss: 7.1960 memory: 0.51GiB(0.64%) wps: 10,892 mfu: 0.33% [rank7]:2024-08-06 03:26:22,229 - root - INFO - step: 10 loss: 7.1208 memory: 0.51GiB(0.64%) wps: 10,798 mfu: 0.33% >>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<< [rank7]:2024-08-06 03:26:50,306 - root - INFO - step: 11 loss: 7.1222 memory: 0.51GiB(0.64%) wps: 866 mfu: 0.03% [rank7]:2024-08-06 03:26:50,632 - root - INFO - step: 12 loss: 7.1189 memory: 0.51GiB(0.64%) wps: 12,589 mfu: 0.38% [rank7]:2024-08-06 03:26:50,917 - root - INFO - step: 13 loss: 6.9646 memory: 0.51GiB(0.64%) wps: 14,417 mfu: 0.44% [rank7]:2024-08-06 03:26:51,217 - root - INFO - step: 14 loss: 6.9626 memory: 0.51GiB(0.64%) wps: 13,680 mfu: 0.42% [rank7]:2024-08-06 03:26:51,514 - root - INFO - step: 15 loss: 6.8694 memory: 0.51GiB(0.64%) wps: 13,799 mfu: 0.42% [rank7]:2024-08-06 03:26:52,207 - root - INFO - step: 16 loss: 6.7994 memory: 0.51GiB(0.64%) wps: 5,910 mfu: 0.18% [rank7]:2024-08-06 03:26:53,053 - root - INFO - step: 17 loss: 6.7634 memory: 0.51GiB(0.64%) wps: 4,847 mfu: 0.15% [rank7]:2024-08-06 03:26:53,370 - root - INFO - step: 18 loss: 6.7233 memory: 0.51GiB(0.64%) wps: 12,915 mfu: 0.39% [rank7]:2024-08-06 03:26:53,686 - root - INFO - step: 19 loss: 6.7054 memory: 0.51GiB(0.64%) wps: 12,995 mfu: 0.39% [rank7]:2024-08-06 03:26:54,059 - root - INFO - step: 20 loss: 6.7130 memory: 0.51GiB(0.64%) wps: 10,991 mfu: 0.33% ```
Configuration menu - View commit details
-
Copy full SHA for 36a0057 - Browse repository at this point
Copy the full SHA 36a0057View commit details
Commits on Aug 14, 2024
-
`torch.nn.Module.to_empty` takes keyword only arg of "device" according to https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to_empty
Configuration menu - View commit details
-
Copy full SHA for 1c96a01 - Browse repository at this point
Copy the full SHA 1c96a01View commit details
Commits on Aug 15, 2024
-
remove old torch dependency in requirements.txt
ghstack-source-id: 7e1c7071f8126072ab0e25194b75f280bf4277ec Pull Request resolved: pytorch#523
Configuration menu - View commit details
-
Copy full SHA for 6c16807 - Browse repository at this point
Copy the full SHA 6c16807View commit details
Commits on Aug 16, 2024
-
Fail when using tracer made without seed checkpoint (pytorch#522)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * pytorch#473 * __->__ pytorch#522
Configuration menu - View commit details
-
Copy full SHA for f339363 - Browse repository at this point
Copy the full SHA f339363View commit details -
uniformly use skip for both (map-style) Dataset and IterableDataset
ghstack-source-id: c8f611742ffbb4859988b97e706b9e0d1b4ad6f1 Pull Request resolved: pytorch#521
Configuration menu - View commit details
-
Copy full SHA for 81c555f - Browse repository at this point
Copy the full SHA 81c555fView commit details
Commits on Aug 20, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 57c3400 - Browse repository at this point
Copy the full SHA 57c3400View commit details