Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with torchtitan #2

Closed
wants to merge 263 commits into from
Closed

Sync with torchtitan #2

wants to merge 263 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Feb 24, 2024

  1. update readme (pytorch#74)

    mostly testing if new repo works or not
    wanchaol authored Feb 24, 2024
    Configuration menu
    Copy the full SHA
    3d1e9ea View commit details
    Browse the repository at this point in the history
  2. move config folder to root and adjust options (pytorch#83)

    as titled, move the config files to the root folder, where it decouples
    with the torchtrain package build, and allow easier navigations
    wanchaol authored Feb 24, 2024
    Configuration menu
    Copy the full SHA
    98a0f79 View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2024

  1. add iter time tracking via cuda events, add data loading times, add c…

    …olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87)
    
    This PR adds basic perf timing and display for 'per iter' and 'final
    iter average' display. (in part based on Andrew's comment about having
    to open the trace to compare iter timing).
    
    1. tracking list is housed in TrainState, but I do not save it as part
    of the state dict as I view this as useful but not saveable info.
    2. iter times are tracked after dataloading is done each iter and after
    optimizer step. The idea is to make this timing expressly the model
    training iter (not data loading or post iter other metrics calcs).
    
    3. 'time' is now displayed at each iter along with the usual loss and
    lr.
    
    4. at the end of training, assuming more than 3 iters run, then the
    average iter time is calculated by igoring the first three iters
    (consider these as warmup esp as cudaCacheAllocator gets warmed up) and
    displayed.
    5. based on @tianyu-l feedback: I have added data loading times as well.
    I used the same timeit.default_timer() from timeit to be consistent.
    (cpu side so no synch's needed :)
    
    6 - after fiddling with printf width formatting options, added beautiful
    aligned columnar display for the per iter updates:
    Now: 
    <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66">
    
    before: 
    <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">
    lessw2020 authored Feb 26, 2024
    Configuration menu
    Copy the full SHA
    629652b View commit details
    Browse the repository at this point in the history
  2. Fill missing options in toml file wih argparse defaults (pytorch#91)

    Summary:
    
    Summary:
    Follow up on config unification, options not available in config file
    are picked from command line defaults.
    
    Test Plan:
    ============================= test session starts
    ============================== platform linux -- Python 3.10.13,
    pytest-8.0.1, pluggy-1.4.0 --
    /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache
    rootdir: /data/users/gnadathur/a/torchtrain
    configfile: pyproject.toml
    plugins: cov-4.1.0
    collecting ... collected 3 items
    
    test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [
    33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED
    [ 66%]
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    PASSED [100%]
    
    ---------- coverage: platform linux, python 3.10.13-final-0 ----------
    Coverage XML written to file coverage.xml
    
    ============================= slowest 20 durations
    ============================= 0.00s call
    test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call
    test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
    call
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    0.00s setup
    test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
    teardown test/test_job_config.py::TestJobConfig::test_command_line_args
    0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file
    0.00s setup
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    0.00s teardown
    test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s
    teardown
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    ============================== 3 passed in 0.06s
    ===============================
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    ---------
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Feb 26, 2024
    Configuration menu
    Copy the full SHA
    c866a64 View commit details
    Browse the repository at this point in the history

Commits on Feb 27, 2024

  1. support infinite loop over alpaca dataset

    ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22
    Pull Request resolved: pytorch#92
    tianyu-l committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    78a1643 View commit details
    Browse the repository at this point in the history
  2. Add color to console output if local logging, auto avoid color loggin…

    …g on slurm (pytorch#93)
    
    This PR adds the ability to do colored console outputs in order to
    highlight the training data outputs.
    It also adds a check to not use this color formatting on slurm, where it
    will add 33= instead of the color if not avoided.
    
    Note that I've just added some color to highlight the main training
    data. Users that fork/clone can use it to enhance their outputs as
    desired.
    
    <img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6">
    
    
    Note that on slurm it remains plain:
    <img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2">
    
    if you dont' check this, then it would otherwise look like this (this
    does not happen with this PR, just showing if we didn't check and credit
    to Yifu for noting this would be an issue):
    <img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">
    lessw2020 authored Feb 27, 2024
    Configuration menu
    Copy the full SHA
    6d9e4e6 View commit details
    Browse the repository at this point in the history
  3. update GPU metrics logging to GiB (gibibytes) (pytorch#95)

    this PR updates the GPU metrics to labelling as GiB - we were
    calculating GiB but calling it GB.
    (credit to @awgu for flagging this - issue
    pytorch#94)
    
    function names and member vars in metrics.py have been updated to _gib
    instead of _gb for clarity, and the logging output now labels as GiB:
    <img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">
    lessw2020 authored Feb 27, 2024
    Configuration menu
    Copy the full SHA
    e987ac3 View commit details
    Browse the repository at this point in the history
  4. improve TensorBoard instructions in README

    ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7
    Pull Request resolved: pytorch#96
    tianyu-l committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    62ff09d View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2024

  1. Enable libUV for torchtrain (pytorch#98)

    Enable libUV for torchtrain.
    
    Test:
    ```
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0,1
    + CONFIG_FILE=./train_configs/debug_model.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
    W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] 
    W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
    W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
    [rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
    [rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
    [rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama
    [rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
    [rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
    [rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params
    [rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
    [rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model...
    [rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled.
    [rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912.
    [rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
    [rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep:  1  �[32mloss: 10.9229  �[39miter: �[34m 1.9386�[39m  data: �[34m0.0368  �[39mlr: �[33m0.00026667�[39m
    [rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep:  2  �[32mloss: 10.8673  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00053333�[39m
    [rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep:  3  �[32mloss: 10.7145  �[39miter: �[34m 0.0523�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0008�[39m
    [rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep:  4  �[32mloss: 10.5038  �[39miter: �[34m 0.0559�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0007�[39m
    [rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep:  5  �[32mloss: 10.2228  �[39miter: �[34m 0.0537�[39m  data: �[34m0.031  �[39mlr: �[33m0.0006�[39m
    [rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep:  6  �[32mloss:  9.9677  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0302  �[39mlr: �[33m0.0005�[39m
    [rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep:  7  �[32mloss:  9.7762  �[39miter: �[34m 0.0544�[39m  data: �[34m0.0317  �[39mlr: �[33m0.0004�[39m
    [rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep:  8  �[32mloss:  9.4359  �[39miter: �[34m 0.0509�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0003�[39m
    [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
    [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
    [rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep:  9  �[32mloss:  9.2326  �[39miter: �[34m 0.1007�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0002�[39m
    [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection
    [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
    [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection
    [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
    [rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
    [rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10  �[32mloss:  9.1641  �[39miter: �[34m 0.0971�[39m  data: �[34m0.031  �[39mlr: �[33m0.0001�[39m
    [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds
    [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds
    [rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
    [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
    [rank0]:num retries: 0, num ooms: 0
    [rank0]:NCCL version 2.19.3+cuda12.0
    ```
    
    ---------
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Feb 28, 2024
    Configuration menu
    Copy the full SHA
    60f6b0d View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2024

  1. use warmup steps for lr scheduler, ban steps == -1 (pytorch#99)

    as titled, we don't want to allow steps == -1 case as it would blow up
    the lr scheduler
    wanchaol authored Feb 29, 2024
    Configuration menu
    Copy the full SHA
    7acab70 View commit details
    Browse the repository at this point in the history
  2. Add llama 7B config (pytorch#100)

    Add 7b config and adjust options to be more realistic
    
    didn't add this to the train scripts as default as it's expensive to
    init, whoever use it can adjust it accordingly
    wanchaol authored Feb 29, 2024
    Configuration menu
    Copy the full SHA
    d5c27a9 View commit details
    Browse the repository at this point in the history
  3. add selective activation checkpointing

    ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40
    Pull Request resolved: pytorch#97
    tianyu-l committed Feb 29, 2024
    Configuration menu
    Copy the full SHA
    2c8cec2 View commit details
    Browse the repository at this point in the history

Commits on Mar 1, 2024

  1. Add job description field in toml (pytorch#101)

    Summary:
    Adding a description field, useful for integration tests to describe the
    test.
    
    Test Plan:
    ```
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0,1
    + CONFIG_FILE=./train_configs/debug_model.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
    W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] 
    W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
    W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
    [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
    [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
    [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training
    [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama
    [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
    [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
    [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params
    [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
    [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model...
    [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled.
    [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705.
    [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
    [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep:  1  �[32mloss: 10.9183  �[39miter: �[34m 1.9258�[39m  data: �[34m0.0303  �[39mlr: �[33m0.00026667�[39m
    [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep:  2  �[32mloss: 10.8347  �[39miter: �[34m 0.0487�[39m  data: �[34m0.0336  �[39mlr: �[33m0.00053333�[39m
    [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep:  3  �[32mloss: 10.6861  �[39miter: �[34m  0.045�[39m  data: �[34m0.0334  �[39mlr: �[33m0.0008�[39m
    [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep:  4  �[32mloss: 10.4672  �[39miter: �[34m 0.0453�[39m  data: �[34m0.0336  �[39mlr: �[33m0.0007�[39m
    [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep:  5  �[32mloss: 10.2154  �[39miter: �[34m 0.0466�[39m  data: �[34m0.033  �[39mlr: �[33m0.0006�[39m
    [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep:  6  �[32mloss:  9.9573  �[39miter: �[34m 0.0496�[39m  data: �[34m0.0314  �[39mlr: �[33m0.0005�[39m
    [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep:  7  �[32mloss:  9.7627  �[39miter: �[34m 0.0486�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0004�[39m
    [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep:  8  �[32mloss:   9.437  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0003�[39m
    [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
    [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
    [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep:  9  �[32mloss:  9.2446  �[39miter: �[34m 0.0794�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0002�[39m
    [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection
    [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
    [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection
    [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
    [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
    [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10  �[32mloss:  9.1772  �[39miter: �[34m 0.0893�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0001�[39m
    [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds
    [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds
    [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
    [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
    [rank0]:num retries: 0, num ooms: 0
    [rank0]:NCCL version 2.19.3+cuda12.0
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Mar 1, 2024
    Configuration menu
    Copy the full SHA
    452baee View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2024

  1. fix 2D parallel crash caused by all-reduce on 2D world_mesh

    ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
    Pull Request resolved: pytorch#105
    tianyu-l committed Mar 2, 2024
    Configuration menu
    Copy the full SHA
    eb3fdd0 View commit details
    Browse the repository at this point in the history

Commits on Mar 5, 2024

  1. Load missing keys default from argparse (pytorch#111)

    ```
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0,1
    + CONFIG_FILE=./train_configs/debug_model.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
    W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
    W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
    W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
    [rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
    [rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
    [rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
    [rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
    [rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
    [rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
    [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
    [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
    [rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
    [rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
    [rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
    [rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
    [rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
    [rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
    [rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
    [rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
    [rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
    [rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
    [rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
    [rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
    [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
    [rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
    [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
    [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
    [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
    [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
    [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
    [rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
    [rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
    [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
    [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
    [rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
    [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
    [rank0]:num retries: 0, num ooms: 0
    [rank0]:NCCL version 2.19.3+cuda12.0
    ```
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Mar 5, 2024
    Configuration menu
    Copy the full SHA
    2682144 View commit details
    Browse the repository at this point in the history
  2. Add meta_init, enable it as default init process (pytorch#84)

    This PR enables meta_init functionality to avoid OOM'ing on cpu for
    larger models.
    The core functionality is in meta_init.py, and a few changes in
    parallelization and train.py.
    Key items:
    1 - this is largely the same as the earlier PR I had for meta_init, but
    I did a new one b/c faster than reworking it with all the interim
    changes.
    2 - to address feedback in previous PR:
    a - why do we need meta_init.py, can't we just do:
    ~~~
    with torch.device("meta"):
        model = Model.from_args(...)
    ~~~
    Unfortunately this does not work b/c the rope embeddings are treated
    differently (buffer) and thus the simple lambda call from param_init_fn
    in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
    move the rope embeddings and the model will fail on first forward.
    This issue relates to the nn.embeddings not being moved, and that the
    device is referenced in the forward pass for the current rope class.
    Have opened pytorch#110 to track
    this and investigate while not holding up meta init that is working from
    landing.
    
    b - per earlier feedback - meta init is now 'not optional' but simply
    the default. This should ensure all models leverage it and ensure we
    aren't missing things for future meta_init aspects.
    
    3 - misc change - I switched the model_params to just do the normal all
    params count instead of 'unique params' b/c it does not mesh with what
    people perceive model size as.
    
    Testing:
    tested both debugmodel and 26B model with and without meta init to
    confirm same loss curves.
    Note for future reference - if you get a bad init (meta init failure)
    you will simply not train (loss is same every iter).
    If you fail to call reset params after FSDP, then you will train (b/c we
    default to torch.randn_like) but your starting loss will be 5x+ higher
    (telling you that you have not properly init'ed the model).
    lessw2020 authored Mar 5, 2024
    Configuration menu
    Copy the full SHA
    afbf62a View commit details
    Browse the repository at this point in the history
  3. Fix feedback from PR 111 (pytorch#113)

    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Mar 5, 2024
    Configuration menu
    Copy the full SHA
    f91f97a View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2024

  1. fix SP minor issues

    ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d
    Pull Request resolved: pytorch#114
    tianyu-l committed Mar 6, 2024
    Configuration menu
    Copy the full SHA
    1a180ee View commit details
    Browse the repository at this point in the history
  2. enable loss parallel in SP

    ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c
    Pull Request resolved: pytorch#112
    tianyu-l committed Mar 6, 2024
    Configuration menu
    Copy the full SHA
    ed04380 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    41f5172 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2024

  1. add miniPile dataset for pretraining, 1M entries (solves the 'out of …

    …data' at 40 iters issue) (pytorch#88)
    
    This PR add's minipile (1M, 6GB) dataset as an option for pretraining
    with torchtrain.
    It resolves the issue where we run out of data after 40 iterations with
    the default alpaca dataset.
    Per @tianyu-l's excellent suggestion, have refactored to have a single
    hf_datasets.py file that supports both minipile and alpaca since it
    turned out no need for any different tokenizer, etc.
    Also cleaned up the datasets package so that create_tokenizer is exposed
    directly, and thus all public apis can be used directly from
    torchtrain.datasets.
    Lastly - added warning if/when a dataset is being re-looped so users
    don't get burned by overfitting:
    <img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a">
    
    
    Adds a color highlight to showcase what dataloader was built:
    <img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0">
    and
    <img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e">
    
    
    Usage:
    just add "minipile" or "alpaca" as the dataset in the training config
    toml file.
    <img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27">
    
    Testing:
    verified training loss is improving and ran for 100 iters to verify no
    issue with out of data any longer with minipile.
    reran with alpaca and saw the expected out of data at 40 iters without
    infinite loop option, runs to 100 with infinite.
    
    Notes:
    I did not make this a default dataset since for debugmodel, mostly
    running 10 iters is fine and there's 6GB to pull down.
    <img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">
    lessw2020 authored Mar 7, 2024
    Configuration menu
    Copy the full SHA
    680f1aa View commit details
    Browse the repository at this point in the history
  2. add data loading option to load from local file system

    ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6
    Pull Request resolved: pytorch#117
    tianyu-l committed Mar 7, 2024
    Configuration menu
    Copy the full SHA
    85263f7 View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2024

  1. add llama 13B configs

    ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad
    Pull Request resolved: pytorch#121
    wanchaol committed Mar 9, 2024
    Configuration menu
    Copy the full SHA
    3c51744 View commit details
    Browse the repository at this point in the history
  2. add llama 70B toml

    ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8
    Pull Request resolved: pytorch#122
    wanchaol committed Mar 9, 2024
    Configuration menu
    Copy the full SHA
    649cf0b View commit details
    Browse the repository at this point in the history
  3. set betas and weight decay for optimizers

    according to suggestions in pytorch#118 (comment)
    
    ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651
    Pull Request resolved: pytorch#123
    wanchaol committed Mar 9, 2024
    Configuration menu
    Copy the full SHA
    ab05f66 View commit details
    Browse the repository at this point in the history
  4. Add c4 dataset (177M, streaming), update multi-node support for lates…

    …t job configs (pytorch#124)
    
    This PR:
    1 - adds the english language portion of c4 dataset, which has 177M
    entries. (https://huggingface.co/datasets/allenai/c4)
    
    Due to the size, streaming is enabled as the default.  
    This is the allen-ai/c4, as apparently the original c4 is being
    deprecated and HF advises to use allen-ai now.
    
    For comparison per @tianyu-l request - 40 iterations avg time:
    alpaca cached loading: Average data load time: 0.0279 seconds
    c4 streaming loading: Average data load time: 0.0290 seconds
    
    There is a longer initial delay during the 'preparing c4' vs alpaca
    (i.e. 45 seconds vs 10 seconds), but after that speed is similar.
    
    Dataset sample (not displayed in training, just an excerpt I pulled to
    double check the data flow):
    <img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121">
    
    2 - I also updated the multi-node slurm file to account for the new job
    config.
    
    Test:
    verified no looping with 100 iterations, 
    sampled data streamed to verify.
    lessw2020 authored Mar 9, 2024
    Configuration menu
    Copy the full SHA
    66c196b View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2024

  1. Add openwebtext dataset for larger scale training without shuffling (p…

    …ytorch#130)
    
    This PR adds the openwebtext 1M dataset. 
    This is a homogenous dataset, so we are able to train successfully while
    not having any shuffle in our dataset loader.
    
    1 - adds the dateset to hf_datasets
    2 - makes the default dataset for 13b and 70b as openwebtext since that
    is the preferred choice for larger scale training.
    
    Testing - ran 5K iters (9 nodes) to verify no spiking issues:
    
    <img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">
    lessw2020 authored Mar 12, 2024
    Configuration menu
    Copy the full SHA
    10229d6 View commit details
    Browse the repository at this point in the history
  2. [TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading (

    …pytorch#131)
    
    This fix would temporarily unblock loading. So we won't run into the
    issue of:
    
    ```
    [rank0]:[rank0]:     train_state.losses.append(train_state.current_loss)
    [rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append'
    ```
    
    However, current_loss and losses are still not correct, since by current
    setup, losses and current_losses would be different across different
    ranks. Also, we don't know the size of losses because this is based on
    the # of steps. So loading still work but the value of current_loss and
    losses are not being loaded correctly.
    
    I will follow up with further fixes.
    wz337 authored Mar 12, 2024
    Configuration menu
    Copy the full SHA
    7fee3cf View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2024

  1. improve logging

    ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a
    Pull Request resolved: pytorch#132
    tianyu-l committed Mar 13, 2024
    Configuration menu
    Copy the full SHA
    7cd2725 View commit details
    Browse the repository at this point in the history
  2. use SequenceParallel style in tp/sp (pytorch#133)

    simplify things given we already have SequenceParallel style landed in
    main
    wanchaol authored Mar 13, 2024
    Configuration menu
    Copy the full SHA
    3161ffb View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2024

  1. support TP-only parallelism

    ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f
    Pull Request resolved: pytorch#137
    tianyu-l committed Mar 14, 2024
    Configuration menu
    Copy the full SHA
    e39ee7e View commit details
    Browse the repository at this point in the history
  2. disable verbose print from profiling

    ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099
    Pull Request resolved: pytorch#136
    tianyu-l committed Mar 14, 2024
    Configuration menu
    Copy the full SHA
    5d18bf0 View commit details
    Browse the repository at this point in the history
  3. add Selective layer activation checkpointing, single control for turn…

    …ing AC on or off. (pytorch#125)
    
    This PR:
    1 - adds selective layer checkpointing - this lets the user select every
    x layer to checkpoint:
    i.e. 2 = every other layer is checkpointed.
    
    spec for config was updated by Wanchao - so we now have this layout for
    AC which is hopefully self-explanatory (covers None, full, Selective Op
    or Selective Layer and layer filtering policy:
    <img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4">
    
    
    Thus, it lets user toggle between traditional 'all layers' to more and
    more fine grained checkpointing.
    Note that I implemented this for IBM last summer and in their llama
    testing, every 2nd layer was the best bang/buck so I have made that the
    default.
    
    2 - the config file has been updated to make a new
    [activation_checkpointing] section and make it easier to modify vs being
    dumped into the training section.
    
    Testing and results:
    I tested all the AC options to ensure all options are working, and that
    we fail if both types are set to true in config:
    <img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e">
    lessw2020 authored Mar 14, 2024
    Configuration menu
    Copy the full SHA
    0d415d7 View commit details
    Browse the repository at this point in the history
  4. remove per iter syncronize

    ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c
    Pull Request resolved: pytorch#134
    tianyu-l committed Mar 14, 2024
    Configuration menu
    Copy the full SHA
    cc2061a View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2024

  1. Shorten nccl comm timeout and enable flight recorder dumping (pytorch…

    …#103)
    
    Timeout
    -------
    
    It's convenient whether during iterative debugging or long running
    training to find out asap about a failure. The default timeout is way
    too long and leads to wasted cluster time or developer frustration.
      
    Timeout can be adjusted via cmdline or in .toml if it needs to be larger
    for a particular model.
    
    Another useful pattern can be to set a large timeout for initialization
    and then tighten it after iteration 1. We can add this later if desired.
    
    Ideally we could pass the timeout to the device mesh ctor, but it's not
    ready yet. Also, we can change timeouts of the existing PGs after
    creating them, but that's more LOC and not necessary unless we want to
    change the timeouts at runtime.
    
    Dumps
    -----
    
    Dumping on timeout should be a safe default for everyone. It has the
    side-effect of requiring a dump path which defaults to ~/pgnccl_dump but
    can be overridden via DUMP_PATH env.
    
    The raw content of the dump is a pickle that is intended to be consumed
    through scripts/tools which are under development, so it may not be easy
    to know how to use these for now. As the tooling matures, we should
    provide reference docs and probably print out pointers in the logs when
    we perform the dump.
    
    
    Test plan:
    tested locally by adding a rank0 sleep for 10sec inside the training
    loop, validating all 8 ranks dumped a trace.
    wconstab authored Mar 15, 2024
    Configuration menu
    Copy the full SHA
    3b3362b View commit details
    Browse the repository at this point in the history
  2. fix up gpu memory monitoring and logging

    ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1
    Pull Request resolved: pytorch#147
    tianyu-l committed Mar 15, 2024
    Configuration menu
    Copy the full SHA
    9f5a56d View commit details
    Browse the repository at this point in the history
  3. Separate timeout during init and training (pytorch#149)

    Allow a tighter timeout during training than during init.
    
    Init includes the first train step, as well as any loading and setup. It
    can be slower and less predictable due to various factors including lazy
    initialization or jit compilation.
    
    After the first train step, we expect more predictable runtime and
    benefit from a tighter timeout to give quick feedback on a hang.
    
    Tested by pasting this code in 2 places
    ```
    if dp_mesh.get_local_rank() == 0 and train_state.step == 1:
       import time
       time.sleep(10)
    ```
    
    (a) before calling set_pg_timeout, which did not cause a timeout (b)
    after calling set_pg_timeout, which timed out
    wconstab authored Mar 15, 2024
    Configuration menu
    Copy the full SHA
    9eb6a21 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2024

  1. Configuration menu
    Copy the full SHA
    6485be9 View commit details
    Browse the repository at this point in the history
  2. Refactor to clean up parallelisms/__init__.py

    (second attempt, didn't land correctly)
    
    ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b
    Pull Request resolved: pytorch#154
    wconstab committed Mar 20, 2024
    Configuration menu
    Copy the full SHA
    fd4c75b View commit details
    Browse the repository at this point in the history
  3. enable gc control scheduling to help avoid stragglers (pytorch#148)

    This PR adds control over Python garbage collection to help avoid
    stragglers during large scale training.
    updates - this feature is now exposed as a controllable option
    gc_schedule, with a default of 50.
    0 = not enabled.
    int = schedules gc at every int iters during training loop. 
    <img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f">
    
    Effectively we disable the gc, run one collection to ensure a good
    starting point, and then at the start of each gc_schedule iter, we call
    gc to free up things.
    
    By enforcing a fixed schedule for collection, it helps all ranks stay
    more in synch.
    Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a
    perf boost of ~1.5% per iter just by virtue of better synch.
    
    (this was originally developed during dist compiler to resolve
    stragglers, I believe @fegin came up with this solution).
    lessw2020 authored Mar 20, 2024
    Configuration menu
    Copy the full SHA
    93c2b7d View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9e7920f View commit details
    Browse the repository at this point in the history
  5. add MFU to metrics

    ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b
    Pull Request resolved: pytorch#151
    tianyu-l committed Mar 20, 2024
    Configuration menu
    Copy the full SHA
    e5d1b89 View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2024

  1. disable buffer reuse for compile for now (pytorch#156)

    disable buffer reuse for compile to have close numerics to eager mode,
    as suggested by @Chillee
    
    This is probably only a temp change until buff reuse fix in inductor
    wanchaol authored Mar 21, 2024
    Configuration menu
    Copy the full SHA
    ceebd53 View commit details
    Browse the repository at this point in the history

Commits on Mar 22, 2024

  1. refactor config manager and support cmd overrides (pytorch#157)

    This PR supports explicit cmd overrides, to allow infra layers to
    override certain options (the most important one is dump_folder)
    wanchaol authored Mar 22, 2024
    Configuration menu
    Copy the full SHA
    32aa083 View commit details
    Browse the repository at this point in the history

Commits on Mar 24, 2024

  1. Configuration menu
    Copy the full SHA
    a21645e View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2024

  1. rename sequence_parallel to tensor_parallel (pytorch#162)

    This PR renames sequence_parallel to tensor_parallel, as sequence
    parallel is only applied to rmsnorm layers, a broader name should be
    tensor_parallel, maybe with sequence_parallel enabled
    
    ghstack broken :( so using direct branch push instead
    wanchaol authored Mar 25, 2024
    Configuration menu
    Copy the full SHA
    e28832e View commit details
    Browse the repository at this point in the history

Commits on Mar 27, 2024

  1. add basic AC configs for 13B and 70B (pytorch#169)

    as titled, currently 13B use selective op, and 70B use selective layer,
    we can do some more experiments and adjust the configs later
    wanchaol authored Mar 27, 2024
    Configuration menu
    Copy the full SHA
    6722657 View commit details
    Browse the repository at this point in the history
  2. [TorchTrain][Checkpoint] Update train state to include global_avg_los…

    …ses and global_max_losses (pytorch#167)
    
    Based on discussion with @tianyu-l, we decided to only checkpoint
    `global_avg_losses` and `global_max_losses` per log frequency iteration
    to avoid all_reduce and device sync every iteration.
    `TrainState.current_loss` and `TrainState.losses` are removed from
    TrainState `state_dict()` and `load_state_dict()` call.
    
    
    Tested with saving/loading with 30 steps with log_frequency = 10 and
    loading with 40 steps to resume training. The numerics in
    global_avg_losses and global_max_losses in the list aligns with
    expected.
    
    ```
    Step 30 save:
    [rank0]:before save: 
    self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])
    
    
    Step 30 load:
    [rank0]:after load:
    self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])
    
    
    Step 40 load and resume training:
    [rank0]:before save: 
    self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31])
    ```
    wz337 authored Mar 27, 2024
    Configuration menu
    Copy the full SHA
    c49cc9e View commit details
    Browse the repository at this point in the history
  3. Basic integration test infra (pytorch#170)

    Summary:
    PR adds an option `use_for_integration_test`. when set to `True`, this
    adds the config to the integration test suite. A test runner picks all
    the configs marked for integration test and run them.
    
    Test Plan:
    ```
    =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0
    + CONFIG_FILE=./train_configs/debug_model.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
    W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757]
    W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
    W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
    [rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training
    [rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
    [rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4]
    [rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
    [rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace
    [rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
    [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model
    [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model
    [rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters
    [rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled
    [rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946
    [rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
    [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
    [rank0]:  warnings.warn(
    [rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep:  1  �[32mloss: 10.9486  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 20,066  �[35mmfu: 0.25%�[39m
    [rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
    [rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep:  2  �[32mloss: 10.8786  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,046  �[35mmfu: 2.60%�[39m
    [rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep:  3  �[32mloss: 10.7362  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 204,441  �[35mmfu: 2.50%�[39m
    [rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep:  4  �[32mloss: 10.5094  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,800  �[35mmfu: 2.66%�[39m
    [rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep:  5  �[32mloss: 10.2755  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,527  �[35mmfu: 2.65%�[39m
    [rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep:  6  �[32mloss: 10.0318  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,117  �[35mmfu: 2.62%�[39m
    [rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep:  7  �[32mloss:  9.7929  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,509  �[35mmfu: 2.65%�[39m
    [rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep:  8  �[32mloss:  9.5539  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 166,639  �[35mmfu: 2.04%�[39m
    [rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep:  9  �[32mloss:  9.3909  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 120,381  �[35mmfu: 1.47%�[39m
    [rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10  �[32mloss:  9.2749  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 207,613  �[35mmfu: 2.54%�[39m
    [rank0]:NCCL version 2.20.5+cuda12.0
    
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    ---------
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Mar 27, 2024
    Configuration menu
    Copy the full SHA
    2b017fd View commit details
    Browse the repository at this point in the history
  4. Add 2D integration test (FSDP + TP) (pytorch#171)

    Summary:
    Add a 2D test to integration test suite
    
    Test Plan:
    
    ```
    
    =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0
    + CONFIG_FILE=./train_configs/debug_model.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
    W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757]
    W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
    W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
    [rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training
    [rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
    [rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4]
    [rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
    [rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace
    [rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
    [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model
    [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model
    [rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters
    [rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled
    [rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429
    [rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
    [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
    [rank0]:  warnings.warn(
    [rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep:  1  �[32mloss: 10.7425  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 21,337  �[35mmfu: 0.26%�[39m
    [rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
    [rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep:  2  �[32mloss: 10.6722  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 208,060  �[35mmfu: 2.55%�[39m
    [rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep:  3  �[32mloss: 10.5435  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 213,622  �[35mmfu: 2.62%�[39m
    [rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep:  4  �[32mloss: 10.3359  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,856  �[35mmfu: 2.61%�[39m
    [rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep:  5  �[32mloss: 10.0965  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 209,326  �[35mmfu: 2.56%�[39m
    [rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep:  6  �[32mloss:  9.8806  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,808  �[35mmfu: 2.66%�[39m
    [rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep:  7  �[32mloss:  9.6442  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,874  �[35mmfu: 2.63%�[39m
    [rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep:  8  �[32mloss:  9.4349  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 220,877  �[35mmfu: 2.70%�[39m
    [rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep:  9  �[32mloss:  9.2674  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 123,924  �[35mmfu: 1.52%�[39m
    [rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10  �[32mloss:  9.1404  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,910  �[35mmfu: 2.63%�[39m
    [rank0]:NCCL version 2.20.5+cuda12.0
    
    =====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh=====
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0
    + CONFIG_FILE=./train_configs/debug_model_2d.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml
    W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757]
    W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
    W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
    [rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training
    [rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
    [rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
    [rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
    [rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace
    [rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
    [rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model
    [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model
    [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model
    [rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters
    [rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled
    [rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430
    [rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
    [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
    [rank0]:  warnings.warn(
    [rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep:  1  �[32mloss: 10.8502  �[33mmemory:  5.71GiB(6.01%)  �[34mwps: 9,259  �[35mmfu: 0.11%�[39m
    [rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
    [rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep:  2  �[32mloss: 10.7671  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 54,430  �[35mmfu: 0.67%�[39m
    [rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep:  3  �[32mloss: 10.6390  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 88,457  �[35mmfu: 1.08%�[39m
    [rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep:  4  �[32mloss: 10.4210  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 90,384  �[35mmfu: 1.11%�[39m
    [rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep:  5  �[32mloss: 10.1648  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 93,058  �[35mmfu: 1.14%�[39m
    [rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep:  6  �[32mloss:  9.9451  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 72,642  �[35mmfu: 0.89%�[39m
    [rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep:  7  �[32mloss:  9.7004  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 85,096  �[35mmfu: 1.04%�[39m
    [rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep:  8  �[32mloss:  9.4422  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 95,860  �[35mmfu: 1.17%�[39m
    [rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep:  9  �[32mloss:  9.2144  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 55,837  �[35mmfu: 0.68%�[39m
    [rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10  �[32mloss:  9.1710  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 86,136  �[35mmfu: 1.05%�[39m
    [rank0]:NCCL version 2.20.5+cuda12.0
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Mar 27, 2024
    Configuration menu
    Copy the full SHA
    ab5d918 View commit details
    Browse the repository at this point in the history

Commits on Mar 28, 2024

  1. Used per-parameter FSDP (pytorch#165)

    **Numeric Parity**
    1D FSDP
    - Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8,
    sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
    - FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
    - FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
    - FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
    - FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
        - Loss curves match between FSDP1 and FSDP2
    - Memory numbers reported as percentage since that is how they are
    logged; can convert against 95.0396 GiB GPU memory
    - Compile: same setup as eager
    - FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved,
    7200-7500 WPS, 33% MFU
    - FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved,
    7200-7500 WPS, 33% MFU
    - FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved,
    8100-8400 WPS, 36% MFU
        - Loss curves slightly better than eager
        - For fun -- how much can we push MFU?
    - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23
    GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
    - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB
    (94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
    - Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel
    instead of two and (2), `reshard_after_forward=False` for the last
    transformer block
    
    2D FSDP
    - Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs,
    local batch size 16 (to preserve global batch size), sequence length
    2048, bf16 mixed precision, fp32 reduce-scatter
    - FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
    - FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
    - Loss curves match 8-way FSDP
    - FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_`
    not all-reducing over TP mesh dimension
    
    <details>
    <summary> Loss curves </summary>
    
    <img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM"
    src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85">
    
    </details>
    
    
    **Meta-Device Initialization**
    - The PyTorch Core guideline is for `module.reset_parameters()` to only
    initialize parameters/buffers immediately owned by `module` (i.e.
    `module.parameters(recurse=False)` and `module.buffers(recurse=False)`).
    - This makes it challenging to specify custom initializations for core
    modules like `nn.Linear` and `nn.Embedding`. For example, in
    @lessw2020's depth-wise truncated normal initialization, the
    `trunc_normal_` standard deviation depends on the layer ID, which is a
    property of the `TransformerBlock` but affects the child `nn.Linear`s.
    - To disambiguate, I suggest avoiding the name `reset_parameters()` in
    the case that we violate the PyTorch Core guideline and instead use a
    different name (e.g. `init_weights`).
    
    **DCP & Save/Load**
    - Tested 1D and 2D by specifying `checkpoint_folder =
    "/tmp/checkpoint_andgu` in the `.toml`, training until saving a
    checkpoint, terminating the run, and restarting the training to load the
    checkpoint -- the loss after loading looks reasonable
    awgu authored Mar 28, 2024
    Configuration menu
    Copy the full SHA
    83c879f View commit details
    Browse the repository at this point in the history
  2. plot losses in loaded TrainState to TensorBoard

    ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b
    Pull Request resolved: pytorch#173
    tianyu-l committed Mar 28, 2024
    Configuration menu
    Copy the full SHA
    f6d9de7 View commit details
    Browse the repository at this point in the history

Commits on Mar 29, 2024

  1. Removed setting global flag for swap_tensors since not needed anymore

    ghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295
    Pull Request resolved: pytorch#178
    awgu committed Mar 29, 2024
    Configuration menu
    Copy the full SHA
    1150944 View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2024

  1. Add integration test with compile enabled (pytorch#183)

    Summary:
    same as title
    
    Test Plan:
    ```
    
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0,1
    + CONFIG_FILE=./train_configs/debug_model_compile.toml
    + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml
    W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757]
    W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
    W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
    W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
    [rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training
    [rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training
    [rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
    [rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4]
    [rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
    [rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace
    [rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
    [rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4]
    [rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
    [rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace
    [rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
    [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model
    [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model
    [rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
    [rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory
    [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model
    [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model
    [rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled
    [rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
    [rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile
    [rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled
    [rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
    [rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile
    [rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
    [rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
    [rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
    [rank1]:  warnings.warn(
    [rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
    [rank0]:  warnings.warn(
    [rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext
    [rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext
    [rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
    [rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
    [rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,739  �[35mmfu: 2.56%�[39m
    [rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
    [rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
    [rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,501  �[35mmfu: 2.55%�[39m
    [rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,416  �[35mmfu: 2.69%�[39m
    [rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,182  �[35mmfu: 2.68%�[39m
    [rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,226  �[35mmfu: 2.67%�[39m
    [rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,015  �[35mmfu: 2.67%�[39m
    [rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,094  �[35mmfu: 2.54%�[39m
    [rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,220  �[35mmfu: 2.54%�[39m
    [rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,814  �[35mmfu: 2.58%�[39m
    [rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,649  �[35mmfu: 2.57%�[39m
    [rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,849  �[35mmfu: 2.58%�[39m
    [rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,542  �[35mmfu: 2.57%�[39m
    [rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,690  �[35mmfu: 2.59%�[39m
    [rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,786  �[35mmfu: 2.59%�[39m
    [rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,833  �[35mmfu: 1.54%�[39m
    [rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,765  �[35mmfu: 1.54%�[39m
    [rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,661  �[35mmfu: 2.54%�[39m
    [rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,426  �[35mmfu: 2.54%�[39m
    [rank0]:NCCL version 2.20.5+cuda12.0
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    ---------
    
    Co-authored-by: gnadathur <[email protected]>
    gnadathur and gnadathur authored Apr 2, 2024
    Configuration menu
    Copy the full SHA
    25ee32f View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2024

  1. remove folding and unfolding of sequence dim in model.py

    ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db
    Pull Request resolved: pytorch#190
    tianyu-l committed Apr 3, 2024
    Configuration menu
    Copy the full SHA
    25f9bff View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2024

  1. bump comm.train_timeout_seconds (pytorch#189)

    this PR bumps this default config to a larger value, as profiling is
    pretty heavy step, a default 5 seconds would likely trigger watchdog
    unintentionally
    wanchaol authored Apr 4, 2024
    Configuration menu
    Copy the full SHA
    c233ecd View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2024

  1. fix checkpoint parser

    ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e
    Pull Request resolved: pytorch#197
    wz337 committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    bb3919d View commit details
    Browse the repository at this point in the history
  2. support sequence of tests and add checkpoint test

    address comments
    
    ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3
    Pull Request resolved: pytorch#198
    wz337 committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    4d593d4 View commit details
    Browse the repository at this point in the history
  3. Make freqs_cis a persistent buffer for pp init

    currently, planning to use a 'seed checkpoint' to initialize the
    pipeline parallel model chunks after moving them from meta device to
    cuda/empty.
    
    non-persistent buffers are incompatible with this approach, as they are
    missing from the checkpoint and thus require manual init.
    
    an alternative is to manually run the initializer for just the
    non-persistent buffers after loading a seed-checkpoint, but this
    approach is nearly equivalent with less code changes.
    
    ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934
    Pull Request resolved: pytorch#201
    wconstab committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    5a0995a View commit details
    Browse the repository at this point in the history
  4. Delete grad scaler, which is unsupported/unused

    grad scaler currently doesn't work with FSDP2, and isn't enabled anyway
    becuase bf16 training is the norm and doens't require it.
    
    remove it for simplicity.  it will be easier to enable pipeline
    parallelism with a simplier loss function setup, but if desired, its
    still possible to support pipeline parallelism with the scaler added
    back in.
    
    ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f
    Pull Request resolved: pytorch#202
    wconstab committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    db204f9 View commit details
    Browse the repository at this point in the history
  5. Factor out loss_fn to share code with pipeline par

    PP requires feeding a loss_fn into the schedule's step so that loss can
    be computed per microbatch as part of the forward/backward scheduling.
    
    As such, it is nice to define loss once and use it both in the non-pp
    code that manually calls f/loss/b and also use it in the pp step().
    
    ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12
    Pull Request resolved: pytorch#203
    wconstab committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    859963d View commit details
    Browse the repository at this point in the history
  6. [TorchTrain] Minor fix for pytorch#197 (pytorch#204)

    The changes made in github editor didn't go in when doing ghstack land.
    wz337 authored Apr 5, 2024
    Configuration menu
    Copy the full SHA
    5d2c148 View commit details
    Browse the repository at this point in the history
  7. Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable…

    … config selectable Norm Type (pytorch#181)
    
    This PR has multiple aspects:
    1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's
    numerical accuracy on both forward and backward with a unit test.
    It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%:
    <img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7">
    
    2 - Adds norms.py to house all 4 norm types, and standardizes to
    [layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a
    create_norms function that then creates the appropriate norm.
    
    3 - Adds np_layernorm, which is layernorm with no affine transformation.
    
    4 - Updates model.py to now support plug and play of any supported norm.
    
    Thus instead of this type of if/then logic in the model class:
    <img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693">
    
    We simply have this:
    <img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2">
    
    This then allows for easy plug and play of any norm type with no
    fiddling around in the model code.
    
    5 - updates run_llama_train.sh to randomly select a port vs previous
    fixed port number. (thanks @yifuwang for this tip!)
    
    
    6 - Now users can quickly select the norm of their choice via the config
    file:
    <img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM"
    src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb">
    
    7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid
    any confusion (per @tianyu-l feedback):
    ~~~
    NotImplementedError: fused_rmsnorm not yet compatible with TP. Please
    use rmsnorm.
    ~~~
    lessw2020 authored Apr 5, 2024
    Configuration menu
    Copy the full SHA
    3471165 View commit details
    Browse the repository at this point in the history
  8. remove .item() per iter

    ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e
    Pull Request resolved: pytorch#206
    tianyu-l committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    5b2bb52 View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2024

  1. Removed cache_k and cache_v comments

    ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7
    Pull Request resolved: pytorch#213
    awgu committed Apr 10, 2024
    Configuration menu
    Copy the full SHA
    7146841 View commit details
    Browse the repository at this point in the history
  2. Some more cleanups

    ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336
    Pull Request resolved: pytorch#212
    awgu committed Apr 10, 2024
    Configuration menu
    Copy the full SHA
    c18d760 View commit details
    Browse the repository at this point in the history
  3. avoid record streams and make color printing a config

    ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc
    Pull Request resolved: pytorch#195
    tianyu-l committed Apr 10, 2024
    Configuration menu
    Copy the full SHA
    e62573d View commit details
    Browse the repository at this point in the history
  4. fix SAC to use the correct reduce_scatter op (pytorch#215)

    as titled, we migrated to the native functional collective so the SAC
    should capture this instead of the old one
    wanchaol authored Apr 10, 2024
    Configuration menu
    Copy the full SHA
    7419d71 View commit details
    Browse the repository at this point in the history
  5. Test runner raises exception on failures (pytorch#216)

    Summary: Test runner  should raise exception on failures.
    
    Test Plan: 
    
    ```
    =====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh  =====
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=/home/gnadathur/local/torchtrain
    + NGPU=4
    + LOG_RANK=0
    + CONFIG_FILE=./train_configs/debug_model.toml
    + overrides=
    + '[' 0 -ne 0 ']'
    
    =====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile=====
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=--training.compile
    + NGPU=4
    + LOG_RANK=0
    + CONFIG_FILE=./train_configs/debug_model.toml
    + overrides=
    + '[' 1 -ne 0 ']'
    + overrides=--training.compile
    + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last):
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module>
    [rank0]:[rank0]:     main(config)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    [rank0]:[rank0]:     return f(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
    [rank0]:[rank0]:     pred = model(input_ids)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    [rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
    [rank0]:[rank0]:     return forward_call(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
    [rank0]:[rank0]:     return fn(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    [rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
    [rank0]:[rank0]:     result = forward_call(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
    [rank0]:[rank0]:     return callback(frame, cache_entry, hooks, frame_state, skip=1)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
    [rank0]:[rank0]:     result = inner_convert(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
    [rank0]:[rank0]:     return _compile(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
    [rank0]:[rank0]:     return function(*args, **kwargs)
    [rank0]:[rank0]:   File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
    [rank0]:[rank0]:     return func(*args, **kwds)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
    [rank0]:[rank0]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    [rank0]:[rank0]:     r = func(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
    [rank0]:[rank0]:     out_code = transform_code_object(code, transform)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
    [rank0]:[rank0]:     transformations(instructions, code_options)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn
    [rank0]:[rank0]:     return fn(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run
    [rank0]:[rank0]:     super().run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
    [rank0]:[rank0]:     self.call_function(fn, args, {})
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function
    [rank0]:[rank0]:     return variables.UserFunctionVariable(fn, source=source).call_function(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    [rank0]:[rank0]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
    [rank0]:[rank0]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
    [rank0]:[rank0]:     return cls.inline_call_(parent, func, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
    [rank0]:[rank0]:     self.call_function(fn, args, {})
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function
    [rank0]:[rank0]:     return func_var.call_function(tx, [obj_var] + args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    [rank0]:[rank0]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
    [rank0]:[rank0]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
    [rank0]:[rank0]:     return cls.inline_call_(parent, func, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
    [rank0]:[rank0]:     self.call_function(fn, args, {})
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    [rank0]:[rank0]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
    [rank0]:[rank0]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
    [rank0]:[rank0]:     return cls.inline_call_(parent, func, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
    [rank0]:[rank0]:     self.call_function(fn, args, {})
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    [rank0]:[rank0]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
    [rank0]:[rank0]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
    [rank0]:[rank0]:     return cls.inline_call_(parent, func, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
    [rank0]:[rank0]:     self.call_function(fn, args, {})
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    [rank0]:[rank0]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
    [rank0]:[rank0]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
    [rank0]:[rank0]:     return cls.inline_call_(parent, func, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX
    [rank0]:[rank0]:     self.call_function(fn, argsvars.items, kwargsvars)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
    [rank0]:[rank0]:     return super().call_function(tx, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    [rank0]:[rank0]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
    [rank0]:[rank0]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
    [rank0]:[rank0]:     return cls.inline_call_(parent, func, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
    [rank0]:[rank0]:     tracer.run()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
    [rank0]:[rank0]:     while self.step():
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
    [rank0]:[rank0]:     self.dispatch_table[inst.opcode](self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
    [rank0]:[rank0]:     return inner_fn(self, inst)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
    [rank0]:[rank0]:     self.call_function(fn, args, {})
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
    [rank0]:[rank0]:     self.push(fn.call_function(self, args, kwargs))
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function
    [rank0]:[rank0]:     return self.obj.call_method(tx, self.name, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method
    [rank0]:[rank0]:     return wrap_fx_proxy(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy
    [rank0]:[rank0]:     return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls
    [rank0]:[rank0]:     example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value
    [rank0]:[rank0]:     raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value
    [rank0]:[rank0]:     ret_val = wrap_fake_exception(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception
    [rank0]:[rank0]:     return fn()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda>
    [rank0]:[rank0]:     lambda: run_node(tx.output, node, args, kwargs, nnmodule)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node
    [rank0]:[rank0]:     raise RuntimeError(make_error_message(e)).with_traceback(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node
    [rank0]:[rank0]:     return getattr(args[0], node.target)(*args[1:], **kwargs)
    [rank0]:[rank0]: torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:0', size=(852480,), dtype=torch.bfloat16),), **{}):
    [rank0]:[rank0]: 'FakeTensor' object has no attribute 'wait'
    [rank0]:
    [rank0]:[rank0]: from user code:
    [rank0]:[rank0]:    File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward
    [rank0]:[rank0]:     h = layer(h, freqs_cis)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl
    [rank0]:[rank0]:     args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward
    [rank0]:[rank0]:     args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward
    [rank0]:[rank0]:     self.wait_for_unshard()
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard
    [rank0]:[rank0]:     foreach_all_gather_copy_out(
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
    [rank0]:[rank0]:     return func(*args, **kwargs)
    [rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out
    [rank0]:[rank0]:     all_gather_work.wait()
    [rank0]:
    [rank0]:[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
    [rank0]:
    [rank0]:
    [rank0]:[rank0]: You can suppress this exception and fall back to eager by setting:
    [rank0]:[rank0]:     import torch._dynamo
    [rank0]:[rank0]:     torch._dynamo.config.suppress_errors = True
    [rank0]:
    E0410 13:32:53.256000 139839630783488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 1554760) of binary: /home/gnadathur/local/a/pytorch-env/bin/python
    E0410 13:32:53.261000 139839630783488 torch/distributed/elastic/multiprocessing/errors/error_handler.py:136] no error file defined for parent, to copy child error file (/tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/0/error.json)
    [rank0]:NCCL version 2.20.5+cuda12.0
    Traceback (most recent call last):
      File "/home/gnadathur/local/a/pytorch-env/bin/torchrun", line 33, in <module>
        sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
      File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
        return f(*args, **kwargs)
      File "/data/users/gnadathur/a/pytorch/torch/distributed/run.py", line 879, in main
        run(args)
      File "/data/users/gnadathur/a/pytorch/torch/distributed/run.py", line 870, in run
        elastic_launch(
      File "/data/users/gnadathur/a/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/data/users/gnadathur/a/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    train.py FAILED
    ------------------------------------------------------------
    Failures:
    [1]:
      time      : 2024-04-10_13:32:49
      host      : devvm4378.nao0.facebook.com
      rank      : 1 (local_rank: 1)
      exitcode  : 1 (pid: 1554762)
      error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/1/error.json
      traceback : Traceback (most recent call last):
        File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
          pred = model(input_ids)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
          return forward_call(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
          return fn(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
          result = forward_call(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
          return callback(frame, cache_entry, hooks, frame_state, skip=1)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
          result = inner_convert(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
          return _compile(
        File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
          return function(*args, **kwargs)
        File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
          return func(*args, **kwds)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
          guarded_code = compile_inner(code, one_graph, hooks, transform)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
          r = func(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
          out_code = transform_code_object(code, transform)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
          transformations(instructions, code_options)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn
          return fn(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run
          super().run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function
          return variables.UserFunctionVariable(fn, source=source).call_function(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function
          return func_var.call_function(tx, [obj_var] + args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX
          self.call_function(fn, argsvars.items, kwargsvars)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function
          return self.obj.call_method(tx, self.name, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method
          return wrap_fx_proxy(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy
          return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls
          example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value
          raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value
          ret_val = wrap_fake_exception(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception
          return fn()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda>
          lambda: run_node(tx.output, node, args, kwargs, nnmodule)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node
          raise RuntimeError(make_error_message(e)).with_traceback(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node
          return getattr(args[0], node.target)(*args[1:], **kwargs)
      torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:1', size=(852480,), dtype=torch.bfloat16),), **{}):
      'FakeTensor' object has no attribute 'wait'
    
      from user code:
         File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward
          h = layer(h, freqs_cis)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl
          args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward
          args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward
          self.wait_for_unshard()
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard
          foreach_all_gather_copy_out(
        File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
          return func(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out
          all_gather_work.wait()
    
      Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
    
    
      You can suppress this exception and fall back to eager by setting:
          import torch._dynamo
          torch._dynamo.config.suppress_errors = True
    
    
    [2]:
      time      : 2024-04-10_13:32:49
      host      : devvm4378.nao0.facebook.com
      rank      : 2 (local_rank: 2)
      exitcode  : 1 (pid: 1554763)
      error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/2/error.json
      traceback : Traceback (most recent call last):
        File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
          pred = model(input_ids)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
          return forward_call(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
          return fn(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
          result = forward_call(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
          return callback(frame, cache_entry, hooks, frame_state, skip=1)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
          result = inner_convert(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
          return _compile(
        File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
          return function(*args, **kwargs)
        File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
          return func(*args, **kwds)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
          guarded_code = compile_inner(code, one_graph, hooks, transform)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
          r = func(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
          out_code = transform_code_object(code, transform)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
          transformations(instructions, code_options)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn
          return fn(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in run
          super().run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/nn_module.py", line 733, in call_function
          return variables.UserFunctionVariable(fn, source=source).call_function(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/user_defined.py", line 719, in call_function
          return func_var.call_function(tx, [obj_var] + args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 339, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1274, in CALL_FUNCTION_EX
          self.call_function(fn, argsvars.items, kwargsvars)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 293, in call_function
          return super().call_function(tx, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
          return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 704, in inline_user_function_return
          return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2353, in inline_call
          return cls.inline_call_(parent, func, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 2469, in inline_call_
          tracer.run()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 848, in run
          while self.step():
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 758, in step
          self.dispatch_table[inst.opcode](self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 482, in wrapper
          return inner_fn(self, inst)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1233, in CALL_FUNCTION
          self.call_function(fn, args, {})
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/symbolic_convert.py", line 698, in call_function
          self.push(fn.call_function(self, args, kwargs))
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/misc.py", line 592, in call_function
          return self.obj.call_method(tx, self.name, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/tensor.py", line 461, in call_method
          return wrap_fx_proxy(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1367, in wrap_fx_proxy
          return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/variables/builder.py", line 1452, in wrap_fx_proxy_cls
          example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1780, in get_fake_value
          raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1712, in get_fake_value
          ret_val = wrap_fake_exception(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1227, in wrap_fake_exception
          return fn()
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1713, in <lambda>
          lambda: run_node(tx.output, node, args, kwargs, nnmodule)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1848, in run_node
          raise RuntimeError(make_error_message(e)).with_traceback(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 1832, in run_node
          return getattr(args[0], node.target)(*args[1:], **kwargs)
      torch._dynamo.exc.TorchRuntimeError: Failed running call_method wait(*(FakeTensor(..., device='cuda:2', size=(852480,), dtype=torch.bfloat16),), **{}):
      'FakeTensor' object has no attribute 'wait'
    
      from user code:
         File "/data/users/gnadathur/a/torchtitan/torchtrain/models/llama/model.py", line 446, in forward
          h = layer(h, freqs_cis)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl
          args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 161, in _pre_forward
          args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 280, in pre_forward
          self.wait_for_unshard()
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 243, in wait_for_unshard
          foreach_all_gather_copy_out(
        File "/data/users/gnadathur/a/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
          return func(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/distributed/_composable/fsdp/_fsdp_collectives.py", line 82, in foreach_all_gather_copy_out
          all_gather_work.wait()
    
      Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
    
    
      You can suppress this exception and fall back to eager by setting:
          import torch._dynamo
          torch._dynamo.config.suppress_errors = True
    
    
    [3]:
      time      : 2024-04-10_13:32:49
      host      : devvm4378.nao0.facebook.com
      rank      : 3 (local_rank: 3)
      exitcode  : 1 (pid: 1554764)
      error_file: /tmp/torchelastic_kyjkblcf/none_kiu1mb22/attempt_0/3/error.json
      traceback : Traceback (most recent call last):
        File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
          return f(*args, **kwargs)
        File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
          pred = model(input_ids)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
          return forward_call(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
          return fn(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
          result = forward_call(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
          return callback(frame, cache_entry, hooks, frame_state, skip=1)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
          result = inner_convert(
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
          return _compile(
        File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
          return function(*args, **kwargs)
        File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
          return func(*args, **kwds)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
          guarded_code = compile_inner(code, one_graph, hooks, transform)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
          r = func(*args, **kwargs)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
          out_code = transform_code_object(code, transform)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
          transformations(instructions, code_options)
        File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.p
    
    ```
    gnadathur authored Apr 10, 2024
    Configuration menu
    Copy the full SHA
    cfdd4af View commit details
    Browse the repository at this point in the history
  6. Revert "Separate TransformerEmbedding layer (pytorch#33)"

    Avoid diverging the model structure (FQNs and checkpoint
    interoperability) with similar models.
    
    This reverts commit f30202c.
    
    ghstack-source-id: 9811f5fa99fdde387efe6018aa00afd28e7e923b
    Pull Request resolved: pytorch#214
    wconstab committed Apr 10, 2024
    Configuration menu
    Copy the full SHA
    144b229 View commit details
    Browse the repository at this point in the history
  7. Fix 2DParallel test (pytorch#219)

    Use `rmsnorm` instead of fused version since 2D does not support fused
    version yet.
    
    Test:
    
    ```
    + export USE_LIBUV=1
    + USE_LIBUV=1
    + TRAINER_DIR=--training.tensor_parallel_degree
    + NGPU=4
    + LOG_RANK=0
    + CONFIG_FILE=./train_configs/debug_model.toml
    + overrides=
    + '[' 3 -ne 0 ']'
    + overrides='--training.tensor_parallel_degree 2 --model.norm_type=rmsnorm'
    + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.tensor_parallel_degree 2 --model.norm_type=rmsnorm
    W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] 
    W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] *****************************************
    W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    W0410 15:50:35.615000 140457870033920 torch/distributed/run.py:757] *****************************************
    [rank0]:2024-04-10 15:50:37,794 - root - INFO - Starting job: LLaMA debug training
    [rank0]:2024-04-10 15:50:37,986 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
    [rank0]:2024-04-10 15:50:38,464 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
    [rank0]:2024-04-10 15:50:38,467 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
    [rank0]:2024-04-10 15:50:38,474 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
    [rank0]:2024-04-10 15:50:38,474 - root - INFO - Preparing alpaca dataset from HuggingFace
    [rank0]:2024-04-10 15:50:40,306 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='rmsnorm')
    [rank0]:2024-04-10 15:50:40,318 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
    [rank0]:2024-04-10 15:50:40,319 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
    [rank0]:2024-04-10 15:50:40,331 - root - INFO - Applied Tensor Parallelism to the model
    [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied selective activation checkpointing to the model
    [rank0]:2024-04-10 15:50:40,337 - root - INFO - Applied FSDP to the model
    [rank0]:2024-04-10 15:50:40,558 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%)
    [rank0]:2024-04-10 15:50:40,558 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1550
    [rank0]:2024-04-10 15:50:40,562 - root - INFO - Training starts at step 1
    [rank0]:2024-04-10 15:50:40,562 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
    [rank0]:2024-04-10 15:50:41,474 - root - INFO - �[36mstep:  1  �[32mloss: 10.8403  �[33mmemory:  5.76GiB(6.06%)  �[34mwps: 8,988  �[35mmfu: 0.11%�[39m
    [rank0]:2024-04-10 15:50:41,475 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
    [rank0]:2024-04-10 15:50:41,652 - root - INFO - �[36mstep:  2  �[32mloss: 10.7703  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 46,364  �[35mmfu: 0.57%�[39m
    [rank0]:2024-04-10 15:50:41,744 - root - INFO - �[36mstep:  3  �[32mloss: 10.6447  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 89,916  �[35mmfu: 1.10%�[39m
    [rank0]:2024-04-10 15:50:41,847 - root - INFO - �[36mstep:  4  �[32mloss: 10.4428  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 80,467  �[35mmfu: 0.99%�[39m
    [rank0]:2024-04-10 15:50:41,946 - root - INFO - �[36mstep:  5  �[32mloss: 10.1726  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 83,747  �[35mmfu: 1.03%�[39m
    [rank0]:2024-04-10 15:50:42,038 - root - INFO - �[36mstep:  6  �[32mloss:  9.9676  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 89,380  �[35mmfu: 1.09%�[39m
    [rank0]:2024-04-10 15:50:42,135 - root - INFO - �[36mstep:  7  �[32mloss:  9.7356  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 85,526  �[35mmfu: 1.05%�[39m
    [rank0]:2024-04-10 15:50:42,232 - root - INFO - �[36mstep:  8  �[32mloss:  9.4619  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 85,349  �[35mmfu: 1.05%�[39m
    [rank0]:2024-04-10 15:50:42,396 - root - INFO - �[36mstep:  9  �[32mloss:  9.2633  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 50,402  �[35mmfu: 0.62%�[39m
    [rank0]:[rank0]:[W410 15:50:42.021475256 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
    [rank0]:2024-04-10 15:50:42,511 - root - INFO - �[36mstep: 10  �[32mloss:  9.2156  �[33mmemory:  6.74GiB(7.09%)  �[34mwps: 71,449  �[35mmfu: 0.88%�[39m
    [rank0]:NCCL version 2.20.5+cuda12.0
    ```
    gnadathur authored Apr 10, 2024
    Configuration menu
    Copy the full SHA
    05c181d View commit details
    Browse the repository at this point in the history
  8. Added initial FSDP readme

    ghstack-source-id: a9204c68f2e315c878677be86c509fc8d6290ffd
    Pull Request resolved: pytorch#218
    awgu committed Apr 10, 2024
    Configuration menu
    Copy the full SHA
    b6414aa View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2024

  1. [TorchTrain][Checkpoint] Add model_weights_only option to train_config (

    pytorch#220)
    
    With `model_weights_only` set to True, we would checkpoint model weights
    only at the end of the training.
    We only consider saving model weights at the end of the training so this
    won't affect preemption and training resume.
    
    With `model_weight_only = True`, we can see the size of checkpoint is
    1/3 of a full checkpoint (74M at step 10 when training completes vs.
    212M at step 5). With this, the converted checkpoint (DCP -> torch.save)
    can be loaded with `torch.load(..., weights_only=True)`.
    
    ```
    (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ python -m torch.distributed.checkpoint.format_utils dcp_to_torch step-10 step-10-model-weights-only.pt 
    Converting checkpoint from step-10 to step-10-model-weights-only.pt using method: 'dcp_to_torch'
    (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ ls
    step-10  step-10-model-weights-only.pt  step-5
    (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ ls -h 
    step-10  step-10-model-weights-only.pt  step-5
    (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ du -h
    212M    ./step-5
    74M     ./step-10
    358M    .
    (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$  du -h step-10-model-weights-only.pt
    74M     step-10-model-weights-only.pt
    (pytorch-3.10) [[email protected] ~/local/torchtrain/test_runner_checkpoint_model_weights_only (main)]$ python3
    Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.load('step-10-model-weights-only.pt', weights_only=True)
    {'model': {'embeddings.freqs_cis': tensor([[ 1.0000+0.0000e+00j,  1.0000+0.0000e+00j,  1.0000+0.0000e+00j,
              ...,  1.0000+0.0000e+00j,  1.0000+0.0000e+00j,
              1.0000+0.0000e+00j],
    ```
    
    One more additional change:
    logging to all ranks on `test_runner.py`.
    wz337 authored Apr 11, 2024
    Configuration menu
    Copy the full SHA
    07a3ec8 View commit details
    Browse the repository at this point in the history
  2. Rename to torchtitan (pytorch#221)

    Trying out a full renaming pass from torchtrian -> torchtitan,
    including:
    1. directory structure
    2. all names inside the repo itself.
    wanchaol authored Apr 11, 2024
    Configuration menu
    Copy the full SHA
    c22d1a8 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2024

  1. Configuration menu
    Copy the full SHA
    55a0187 View commit details
    Browse the repository at this point in the history
  2. Add 1 sec delay to rank 0 cleanup (pytorch#224)

    Add the delay as a short term workaround the TCPStore cleanup sync issue
    (pytorch/pytorch#123969)
    Test:
    Ran `TORCH_NCCL_ABORT_IN_DESTROY_PG=1
    CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3
    ./run_llama_train.sh --checkpoint.folder
    ./test_runner_checkpoint_full_checkpoint` 10 times w/o failure.
    gnadathur authored Apr 12, 2024
    Configuration menu
    Copy the full SHA
    2373509 View commit details
    Browse the repository at this point in the history
  3. [Torchtrain][Checkpoint] Add support to allow dtype conversion (pytor…

    …ch#222)
    
    Adds a field of checkpoint.export_dtype: we allow dtype conversion only
    when we are checkpoint model weights only and the current dtype is not
    the same as the export dtype at the end of the training.
    
    Also add a change to get rid of `freqs_cis` buffer when exporting.
    
    
    We can see with export_dtype=bf16, the model weights is about half of
    the size when comparing to export_dtype=fp32.
    ```
    # model_weights_only=false
    (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_full_checkpoint
    212M    test_runner_checkpoint_full_checkpoint/step-5
    212M    test_runner_checkpoint_full_checkpoint/step-10
    212M    test_runner_checkpoint_full_checkpoint/step-15
    212M    test_runner_checkpoint_full_checkpoint/step-20
    846M    test_runner_checkpoint_full_checkpoint
    
    # model_weights_only=true and export_dtype = fp32
    (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_model_weights_only
    212M    test_runner_checkpoint_model_weights_only/step-5
    70M     test_runner_checkpoint_model_weights_only/step-10
    281M    test_runner_checkpoint_model_weights_only
    
    # model_weights_only=true and export_dtype = bf16
    (pytorch-3.10) [[email protected] ~/local/torchtrain (add_export_dtype)]$ du -h test_runner_checkpoint_model_weights_only_bf16
    212M    test_runner_checkpoint_model_weights_only_bf16/step-5
    35M     test_runner_checkpoint_model_weights_only_bf16/step-10
    247M    test_runner_checkpoint_model_weights_only_bf16
    ```
    wz337 authored Apr 12, 2024
    Configuration menu
    Copy the full SHA
    fd5ad5a View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    009b14f View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2024

  1. codebase cleanup

    ghstack-source-id: 33295ce9c9038163e903867cd81799e8848cc749
    Pull Request resolved: pytorch#228
    tianyu-l committed Apr 15, 2024
    Configuration menu
    Copy the full SHA
    c7d5865 View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2024

  1. Update README to reflect positioning (pytorch#229)

    as titled, update README to reflect our positioning for the repo
    wanchaol authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    f86bfb2 View commit details
    Browse the repository at this point in the history
  2. First release readme (pytorch#227)

    Reworked readme to highlight first release and feature set. 
    q - use our logo?  (I think it adds some spark). 
    
    Visual preview:
    <img width="898" alt="Screenshot 2024-04-14 at 7 02 39 PM"
    src="https://github.com/pytorch/torchtitan/assets/46302957/60b4b6a8-c4f3-41a8-8d8d-27b924f8de15">
    lessw2020 authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    a10262a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a0a7ff7 View commit details
    Browse the repository at this point in the history
  4. use permalink for logo image (pytorch#232)

    update logo to permalink to ensure viewable by all.
    lessw2020 authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    d8b7c7f View commit details
    Browse the repository at this point in the history
  5. [TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and…

    … a few config updates (pytorch#230)
    
    Let CheckpointManager take entire job_config as an arg so we can keep
    train.py a little bit cleaner.
    
    Discussed with @tianyu-l and made a few additional changes, including:
    1. Rename "run_profiler" to "enable_profiling"
    2. Add an "enable_checkpoint" flag so it is consistent to
    "enable_profiling" or "enable_tensorboard". We feel like this is a
    little bit more explicit.
    3. Change the default checkpoint folder to be ".outputs/checkpoint" when
    checkpoint is enabled.
    4. Rename "folder" in [checkpiont]" to be "checkpoint_folder"
    5. Change save_traces_folder to be "./outputs/profile_trace" from
    ".outputs/profiling/traces".
    wz337 authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    6596219 View commit details
    Browse the repository at this point in the history
  6. use combo of html and local file src for logo (pytorch#234)

    It seems the permalink for the logo is not fully working as expected. 
    thus switching to combo of html plus local file reference for src.
    lessw2020 authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    1601d35 View commit details
    Browse the repository at this point in the history
  7. add performance -- infra metrics and loss curves (pytorch#237) (pytor…

    …ch#238)
    
    Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
    bottom):
    * __->__ pytorch#237
    
    
    WPS / MFU numbers, and loss curves jobs can be found from this tracking
    [spreadsheet](https://docs.google.com/spreadsheets/d/11kcula5ybuABSZkm2OlFng5NQ9_rnVB-KRyeQq6P7fo/edit#gid=0).
    
    Co-authored-by: tianyu-l <[email protected]>
    lessw2020 and tianyu-l authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    63d752b View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    10b572d View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    7781fd7 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    441b33f View commit details
    Browse the repository at this point in the history
  11. Update README (pytorch#242)

    wanchaol authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    53dc5eb View commit details
    Browse the repository at this point in the history
  12. Add torchtune checkpoint link, modify product position statement loca…

    …tion (pytorch#241)
    
    This PR:
    1 - add's feature note and link to checkpoint doc on supporting
    torchtitan weights being saved and loaded into torchtune for fine
    tuning.
    2 - moves the product position info from top of page to bottom.
    lessw2020 authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    16701c3 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    b889f3d View commit details
    Browse the repository at this point in the history
  14. minor doc updates - remove asynch checkpt ref, grammar on prod positi…

    …on, update checkpointing from 5 to 500 (pytorch#243)
    
    3 minor readme / doc updates. 
    1 - remove : and please note from product position statement.
    2 - remove (asynch checkpointing) from current feature listing of dist
    checkpointing (it's noted as pending feature).
    3 - update default checkpoint interval from 5 to 500
    lessw2020 authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    b60c6bd View commit details
    Browse the repository at this point in the history
  15. Fix multi-line string usage (pytorch#244)

    Summary: use `"""` for multi-line strings instead of tuple syntax which
    breaks arg parse.
    
    Test Plan: ```
    ============================= test session starts
    ============================== platform linux -- Python 3.10.14,
    pytest-8.1.1, pluggy-1.4.0 --
    /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache
    hypothesis profile 'default' ->
    database=DirectoryBasedExampleDatabase(PosixPath('/data/users/gnadathur/a/torchtitan/.hypothesis/examples'))
    benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False
    min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10
    warmup=False warmup_iterations=100000) rootdir:
    /data/users/gnadathur/a/torchtitan
    configfile: pyproject.toml
    plugins: hypothesis-6.100.1, benchmark-4.0.0, typeguard-4.2.1,
    cov-5.0.0, hydra-core-1.3.2 collecting ... collected 6 items
    
    test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [
    16%]
    test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [
    33%]
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    PASSED [ 50%]
    test/test_job_config.py::TestJobConfig::test_empty_config_file PASSED [
    66%]
    
    test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides
    PASSED [ 83%]
    test/test_job_config.py::TestJobConfig::test_print_help PASSED [100%]
    
    ---------- coverage: platform linux, python 3.10.14-final-0 ----------
    Coverage XML written to file coverage.xml
    
    
    ============================= slowest 20 durations
    =============================
    0.00s call     test/test_job_config.py::TestJobConfig::test_print_help
    0.00s call
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file
    0.00s call
    test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides
    0.00s call
    test/test_job_config.py::TestJobConfig::test_empty_config_file
    0.00s call
    test/test_job_config.py::TestJobConfig::test_command_line_args
    0.00s setup
    test/test_job_config.py::TestJobConfig::test_command_line_args
    0.00s teardown
    test/test_job_config.py::TestJobConfig::test_command_line_args
    0.00s teardown
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    0.00s teardown
    test/test_job_config.py::TestJobConfig::test_job_config_file
    0.00s setup
    test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides
    0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file
    0.00s teardown test/test_job_config.py::TestJobConfig::test_print_help
    0.00s setup
    test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
    0.00s setup
    test/test_job_config.py::TestJobConfig::test_empty_config_file
    0.00s setup    test/test_job_config.py::TestJobConfig::test_print_help
    0.00s teardown
    test/test_job_config.py::TestJobConfig::test_job_config_file_cmd_overrides
    0.00s teardown
    test/test_job_config.py::TestJobConfig::test_empty_config_file
    ============================== 6 passed in 0.19s
    ===============================
    ```
    gnadathur authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    09d0047 View commit details
    Browse the repository at this point in the history
  16. polish toml files

    ghstack-source-id: 287d31e9a14861244f1292f61604a296fb7d4e11
    Pull Request resolved: pytorch#245
    tianyu-l committed Apr 16, 2024
    Configuration menu
    Copy the full SHA
    c9454d3 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    9537825 View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2024

  1. fix default max_seq_len for freq_cis init (pytorch#248)

    as titled, looks like llama2 default one is 2048 instead of the current
    number (source
    https://github.com/meta-llama/llama/blob/main/llama/model.py#L31)
    wanchaol authored Apr 17, 2024
    Configuration menu
    Copy the full SHA
    7af51cf View commit details
    Browse the repository at this point in the history
  2. set max_seq_len before training to make it align with input data (pyt…

    …orch#249)
    
    as titled, we need to set this to get the accurate seq_length set from
    the dataloader config. This would ensure the max_seq_len always correct
    so that rope init would be always correct
    
    <img width="946" alt="Screenshot 2024-04-17 at 1 00 29 PM"
    src="https://github.com/pytorch/torchtitan/assets/9443650/39942187-cf37-4cef-b380-644a1a9b9d35">
    wanchaol authored Apr 17, 2024
    Configuration menu
    Copy the full SHA
    0c655b8 View commit details
    Browse the repository at this point in the history
  3. fix pypi docs

    ghstack-source-id: e7f7f4d6f1685072ded6da899bac3ed1ba22dffa
    Pull Request resolved: pytorch#247
    tianyu-l committed Apr 17, 2024
    Configuration menu
    Copy the full SHA
    9949284 View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2024

  1. update dataset to use c4

    ghstack-source-id: 7c390da9d746a75a8c93811c21fb92fb418ae08b
    Pull Request resolved: pytorch#252
    tianyu-l committed Apr 18, 2024
    Configuration menu
    Copy the full SHA
    bfe9998 View commit details
    Browse the repository at this point in the history
  2. Add c4_mini, a local 45K dataset (subset of c4) (pytorch#253)

    This PR adds a 45K (and thus just under the github 100MB limit) local
    dataset.
    This enables:
    a - a ready to run dataset for users to run debug model with 
    b - local dataset for CI
    c - dataset that is not relying on HuggingFace connection (recall when
    HF went down and everything came to halt).
    
    <img width="1275" alt="Screenshot 2024-04-17 at 8 09 13 PM"
    src="https://github.com/pytorch/torchtitan/assets/46302957/89df4ea8-37f4-4705-a6ed-4ca9415409f3">
    lessw2020 authored Apr 18, 2024
    Configuration menu
    Copy the full SHA
    f80223b View commit details
    Browse the repository at this point in the history
  3. remove logo, update pre-release date to 4/18 (pytorch#254)

    as per title - remove logo until we have marketing approval and update
    readme pre-release date from 4/16 to 4/18.
    lessw2020 authored Apr 18, 2024
    Configuration menu
    Copy the full SHA
    6926922 View commit details
    Browse the repository at this point in the history
  4. add intro video (pytorch#233)

    testing embedding video into readme.
    Note that embedded videos are not supported, so the best we can do here
    is mimic it with a thumbnail and play button, that then jumps you to YT
    playing the video.
    lessw2020 authored Apr 18, 2024
    Configuration menu
    Copy the full SHA
    d6f72e2 View commit details
    Browse the repository at this point in the history
  5. add performance file to show convergence with 64 a100s (pytorch#255)

    add performance.md to show the convergence curves (file is from
    @tianyu-l ).
    lessw2020 authored Apr 18, 2024
    Configuration menu
    Copy the full SHA
    395a526 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2024

  1. Support Llama3 8b/70b (pytorch#256)

    This PR adds support for Llama3 8b/70b, mainly it:
    - add tiktonizer, add instructions to download tokenizer
    - add options for the llama model to support Llama3
    - add Llama3 8b/70b configs
    wanchaol authored Apr 20, 2024
    Configuration menu
    Copy the full SHA
    df2dcc7 View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2024

  1. polish llama 3 setup

    ghstack-source-id: 4dd1cdb033e840e00cacd98339780424231b595b
    Pull Request resolved: pytorch#257
    tianyu-l committed Apr 22, 2024
    Configuration menu
    Copy the full SHA
    2db26cf View commit details
    Browse the repository at this point in the history

Commits on Apr 23, 2024

  1. reenable integration tests with a test tokenizer (pytorch#259)

    as titled, the test tokenizer borrowed from torchtune
    https://github.com/pytorch/torchtune/blob/main/tests/assets/tiktoken_small.model,
    where this small test model is offline generated from
    https://gist.github.com/ebsmothers/54b133dd87db6679b14318545aaa2de4 so
    it should have no correlation with any specific model/data
    wanchaol authored Apr 23, 2024
    Configuration menu
    Copy the full SHA
    4b60829 View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2024

  1. Configuration menu
    Copy the full SHA
    b2ee158 View commit details
    Browse the repository at this point in the history
  2. De-dup repeated freqs_cis computation code

    ghstack-source-id: b4fe7f63f15bab367cf00b5d408eb43c640541c2
    Pull Request resolved: pytorch#262
    awgu committed Apr 24, 2024
    Configuration menu
    Copy the full SHA
    3b51460 View commit details
    Browse the repository at this point in the history
  3. update readme.md and performance.md

    ghstack-source-id: a9bd1d33bf7bc9f5055a645c9639bcbe628afbfb
    Pull Request resolved: pytorch#258
    tianyu-l committed Apr 24, 2024
    Configuration menu
    Copy the full SHA
    1ea476e View commit details
    Browse the repository at this point in the history
  4. followup changes to allow unsupported datasets

    ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735
    Pull Request resolved: pytorch#261
    tianyu-l committed Apr 24, 2024
    Configuration menu
    Copy the full SHA
    f8863bd View commit details
    Browse the repository at this point in the history
  5. fix ac 'checkpointing' spelling, minor spacing tweaks (pytorch#265)

    This PR is mainly to fix the spelling where activation checkpointing is
    missing an n... (**checkpoiting**).
    Not sure how I missed it earlier but it's glaring when you see the
    charts in visual form (vs text).
    
    <img width="578" alt="Screenshot 2024-04-24 at 2 45 25 PM"
    src="https://github.com/pytorch/torchtitan/assets/46302957/a81727b2-07b1-4d69-a0c1-743d74d2aa5a">
    
    fixed:
    <img width="592" alt="Screenshot 2024-04-24 at 3 10 30 PM"
    src="https://github.com/pytorch/torchtitan/assets/46302957/769e51db-4aa6-4dbd-99d8-7e691658e280">
    
    
    Also add a couple line breaks to help with layout, and one or two minor
    grammar updates.
    lessw2020 authored Apr 24, 2024
    Configuration menu
    Copy the full SHA
    157a12c View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2024

  1. Update legal terms (pytorch#269)

    Update to final legal license terms requested by Meta legal for release.
    lessw2020 authored Apr 25, 2024
    Configuration menu
    Copy the full SHA
    0891fa3 View commit details
    Browse the repository at this point in the history
  2. apply less heavy profiling

    ghstack-source-id: 2b74fe48dbeae0367a41214c6d0e8b1fcd608db8
    Pull Request resolved: pytorch#270
    tianyu-l committed Apr 25, 2024
    Configuration menu
    Copy the full SHA
    aea510d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e6d0d08 View commit details
    Browse the repository at this point in the history
  4. Doc Fixes (pytorch#273)

    * Image was very blurry
    * Markdown formatting was off
    * Simplified some sentences
    msaroufim authored Apr 25, 2024
    Configuration menu
    Copy the full SHA
    15057dd View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2024

  1. fix lr scheduling by checkpointing scheduler

    ghstack-source-id: 606aee2c4815173958b30ca34a3dbf8e90aed8de
    Pull Request resolved: pytorch#275
    tianyu-l committed Apr 26, 2024
    Configuration menu
    Copy the full SHA
    fd01061 View commit details
    Browse the repository at this point in the history
  2. insert barrier to profiler to resolve collectives timeout

    ghstack-source-id: cc29739b147fe1f52bfc5b791330fd7cf1659652
    Pull Request resolved: pytorch#271
    tianyu-l committed Apr 26, 2024
    Configuration menu
    Copy the full SHA
    4333aca View commit details
    Browse the repository at this point in the history
  3. some misc changes (pytorch#278)

    1. update readme
    2. small refactor to loss_parallel part
    wanchaol authored Apr 26, 2024
    Configuration menu
    Copy the full SHA
    a3b529a View commit details
    Browse the repository at this point in the history
  4. inherit stateful protocol where appropriate

    ghstack-source-id: d410f30ec715bfb4206459becb95abeed5a4ae02
    Pull Request resolved: pytorch#281
    tianyu-l committed Apr 26, 2024
    Configuration menu
    Copy the full SHA
    b898545 View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2024

  1. Fixed docs on HSDP sharding/replication dims

    ghstack-source-id: 77f650e8281dae12f2a7ccdb415be88f9abd88cc
    Pull Request resolved: pytorch#283
    awgu committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    935b572 View commit details
    Browse the repository at this point in the history
  2. Add more Float8 description (pytorch#284)

    # Summary
    
    Add more the possible options in the configs and add a note on how to
    get the dependency at the top of the file.
    drisspg authored Apr 29, 2024
    Configuration menu
    Copy the full SHA
    f61e0ba View commit details
    Browse the repository at this point in the history
  3. Remove unneeded torchvision/audio deps

    ghstack-source-id: dbd201ad2976537487123fa583c86ddab06a7387
    Pull Request resolved: pytorch#250
    wconstab committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    8697234 View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2024

  1. fix 3d mesh order (pytorch#288)

    as titled, fixes pytorch#286
    wanchaol authored Apr 30, 2024
    Configuration menu
    Copy the full SHA
    a6d2625 View commit details
    Browse the repository at this point in the history
  2. unify data loading from HF and from disk

    ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe
    Pull Request resolved: pytorch#287
    tianyu-l committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    258f608 View commit details
    Browse the repository at this point in the history

Commits on May 1, 2024

  1. Add periodic integration test with signal (pytorch#289)

    Runs the integration test hourly and updates signal badge. Tested on
    existing integration test. I will update the badge with periodic test
    signal once workflow has landed in this PR.
    <img width="516" alt="Screenshot 2024-04-30 at 6 12 00 PM"
    src="https://github.com/pytorch/torchtitan/assets/1779702/8adaab3d-df18-483d-a39f-5af316b7edbc">
    gnadathur authored May 1, 2024
    Configuration menu
    Copy the full SHA
    10ef7a6 View commit details
    Browse the repository at this point in the history

Commits on May 2, 2024

  1. exclude embedding in MFU computation

    ghstack-source-id: 9daa99020c76fdfe429b6a9ee6d44fd1dd319fc3
    Pull Request resolved: pytorch#280
    tianyu-l committed May 2, 2024
    Configuration menu
    Copy the full SHA
    0c6ca90 View commit details
    Browse the repository at this point in the history
  2. Add support for seed checkpoint creation for meta-init flow

    Adds new command ./create_seed_checkpoint.sh which largely
    reuses code inside train.py to create the model and then save its
    initial state as a step-0 checkpoint for use with meta-initialization
    loading flow.
    
    ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454
    Pull Request resolved: pytorch#172
    wconstab committed May 2, 2024
    Configuration menu
    Copy the full SHA
    e34d2ac View commit details
    Browse the repository at this point in the history
  3. remove unnecessary install of torchtitan

    ghstack-source-id: fa9aaf337b5489d88945f15b65a8ba8cc544ded6
    Pull Request resolved: pytorch#295
    tianyu-l committed May 2, 2024
    Configuration menu
    Copy the full SHA
    1480766 View commit details
    Browse the repository at this point in the history
  4. Remove unnecessary .to() inside model forward

    This appears to be a holdover from a previous way the initialization
    worked.
    
    freqs_cis should already be on gpu device after initialization.
    
    ghstack-source-id: 7159320d4ecfb436bd2193277a88c04d136e9ad0
    Pull Request resolved: pytorch#298
    wconstab committed May 2, 2024
    Configuration menu
    Copy the full SHA
    add0261 View commit details
    Browse the repository at this point in the history

Commits on May 3, 2024

  1. Fix the incorrect step log for profiler after resuming from a checkpo…

    …int (pytorch#293)
    
    Summary:
    The profiler currently maintains a counter locally and that counter is
    not synchronized with the checkpointed train step. This PR fixes the
    issue.
    fegin authored May 3, 2024
    Configuration menu
    Copy the full SHA
    3e2fa85 View commit details
    Browse the repository at this point in the history
  2. turn off dynamic shape for torch.compile (pytorch#297)

    as titled. This could make 1-D and 2-D works with the lastest main
    build. thanks @bdhirsh for all the fixes!
    
    We should figure out why dynamic shape gets turned on as a follow up
    wanchaol authored May 3, 2024
    Configuration menu
    Copy the full SHA
    5e84866 View commit details
    Browse the repository at this point in the history
  3. Renamed bsz to bs for consistency; removed dead code

    ghstack-source-id: bbedad3819ab9ef90b233209c34dd1dbc846b06a
    Pull Request resolved: pytorch#299
    awgu committed May 3, 2024
    Configuration menu
    Copy the full SHA
    8996249 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. Implement async_checkpoint

    Summary:
    This PR implements 2 different async checkpoint. The first one is to use
    DCP.async_save another one is to use pinned memory + a seperate process
    to avoid GILs issue.
    
    ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd
    Pull Request resolved: pytorch#313
    fegin committed May 7, 2024
    Configuration menu
    Copy the full SHA
    5d63fff View commit details
    Browse the repository at this point in the history

Commits on May 8, 2024

  1. simplify embedding + first transformer block TP (pytorch#314)

    as titled, we can directly specify the rowwise parallel embedding output
    layouts be shard on sequence dim, so that we don't need the first layer
    prepare input.
    
    Switching to output_layouts = Shard(1) would also trigger reduce_scatter
    instead of allreduce for embedding layer, which could give some small
    perf wins
    wanchaol authored May 8, 2024
    Configuration menu
    Copy the full SHA
    26ff44f View commit details
    Browse the repository at this point in the history

Commits on May 10, 2024

  1. Only include checkpoints that have .metadata written (pytorch#315)

    .metadata may be missing in some checkpoints if some ranks did not
    checkpoint properly. This PR filters out checkpoints that do not have
    .metadata in them.
    liangluofb authored May 10, 2024
    Configuration menu
    Copy the full SHA
    ad46097 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2024

  1. Refactor freqs_cis slice to be safer for PP

    Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given
    batch.
    
    Changed: instead of slicing self.freqs_cis down to seqlen at top level
    transformer based on the input token shape, we slice it down to seqlen
    inside a transformer layer after we have re-expanded to the full seqlen
    in cases where TP has sharded across seqlen.
    
    In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but
    we do not generally know this.  That makes it hard for stage1 to slice
    freqs_cis correctly.  It's easy to do the slicing deeper inside, since
    at that point we do know the full seqlen unambiguously.
    
    Note: the full self.freqs_cis is stored in memory either way, and the
    thing passed into every layer is just a view. This change should not be
    material for memory usage or otherwise.
    
    ghstack-source-id: 20ef05e0734e53260366878dfe0fac5e1ab48f1d
    Pull Request resolved: pytorch#321
    wconstab committed May 13, 2024
    Configuration menu
    Copy the full SHA
    99729e9 View commit details
    Browse the repository at this point in the history
  2. Make Transformer tolerate missing layers for PP

    A few small changes here lets manual PP frontend 'reconfigure' a whole
    transformer model to a stage's portion simply by setting undesired
    layers to None (in cases of top level layers) or deleting them from the
    ModuleDict (for 'layers.*').
    
    These changes don't impact the FQNs of the remaining layers, which is
    critical for checkpoint load/save compatibility.
    
    ghstack-source-id: 48a7aafc89d86c3168f905560a4cd6bf4b5b9a12
    Pull Request resolved: pytorch#322
    wconstab committed May 13, 2024
    Configuration menu
    Copy the full SHA
    14d422f View commit details
    Browse the repository at this point in the history

Commits on May 15, 2024

  1. Use torch generic workflow for CI

    ghstack-source-id: b1fa8d8c1778ecb532ed71792ead9f4dbb067cf4
    Pull Request resolved: pytorch#325
    wconstab committed May 15, 2024
    Configuration menu
    Copy the full SHA
    ac94484 View commit details
    Browse the repository at this point in the history
  2. [checkpointing] import async checkpoint with pinned memory only when …

    …needed
    
    ghstack-source-id: e460a8d6458f191f7f589fc908974f896b514690
    Pull Request resolved: pytorch#333
    tianyu-l committed May 15, 2024
    Configuration menu
    Copy the full SHA
    41d69d2 View commit details
    Browse the repository at this point in the history

Commits on May 16, 2024

  1. Add a workflow to build torchtitan-ubuntu-20.04-clang12 Docker image …

    …for CI (pytorch#338)
    
    Adopt from PyTorch, this workflow will prepare the Docker image
    `torchtitan-ubuntu-20.04-clang12` for the CI.
    
    * Base on
    https://hub.docker.com/layers/nvidia/cuda/12.1.0-cudnn8-runtime-ubuntu20.04/images/sha256-35d5a8eb50ad37fe707a7611a4e20414c5bd2f168adca0cf1700fe2d58411759
    to include NVIDIA dependencies.
    * Install `dev-requirements.txt` and `requirements.txt`. I need to move
    these files from the top level to `.ci/docker` directory and create
    softlinks for them because docker build process will only take a look at
    `.ci/docker`. This is the reason why PyTorch keeps its CI requirements
    file there.
    * Install clang or gcc
    * Install conda (with python 3.11)
    
    `torchtitan-ubuntu-20.04-clang12` can then be used as the input for
    `docker-image`.
    huydhn authored May 16, 2024
    Configuration menu
    Copy the full SHA
    6ed5237 View commit details
    Browse the repository at this point in the history

Commits on May 17, 2024

  1. Make pip install torch quiet

    ghstack-source-id: 55302fd52dd6ee452c795e89170d0b1299218c87
    Pull Request resolved: pytorch#342
    wconstab committed May 17, 2024
    Configuration menu
    Copy the full SHA
    2dca85e View commit details
    Browse the repository at this point in the history
  2. Make test_runner.py warn on non-empty output dir

    also wrap logic into functions and clean up global vars
    
    ghstack-source-id: 815c582011611a71005cc22bbd14310900465377
    Pull Request resolved: pytorch#343
    wconstab committed May 17, 2024
    Configuration menu
    Copy the full SHA
    3baba7b View commit details
    Browse the repository at this point in the history

Commits on May 21, 2024

  1. Expose mixed_precision dtype arguments

    add training.mixed_precision_param and .mixed_precision_reduce options
    
    refactor a util to map strings to torch dtypes
    
    ghstack-source-id: 387e1ca13ad23e859d21d7760f858ee6e269a796
    Pull Request resolved: pytorch#348
    wconstab committed May 21, 2024
    Configuration menu
    Copy the full SHA
    5c69c02 View commit details
    Browse the repository at this point in the history
  2. Use stateful dataloader to checkpoint data iteration order and token …

    …buffer (pytorch#279)
    
    Summary: 
    
    Use the stateful_dataloader from torchdata
    (https://github.com/pytorch/data/tree/main/torchdata/stateful_dataloader)
    for storing the token buffer and iteration data order. It requires a
    dependency on the nightly build of torchdata >= 20240426.
    
    Also make sure the dataloader state has a different key per rank.
    
    Test Plan:
    
    Tested locally by first running 30 steps (checkpointing every 5 steps)
    and capturing all the loss values. Then deleting the last 3 checkpoints
    and then re-run the training and the loss values from step 16-30 match
    with what we had earlier in the first run. Note that this requires
    changes in the train.py to enable a deterministic run.
    
    Reviewers: @tianyu-l 
    
    Subscribers: @andrewkho 
    
    Tasks:
    
    Tags:
    gokulavasan authored May 21, 2024
    Configuration menu
    Copy the full SHA
    8cc0b38 View commit details
    Browse the repository at this point in the history
  3. Add Pipeline Parallel (and 2D PP+FSDP) support

    runs PP+DP and PP+TP without issue,
    runs PP+TP+DP with decreasing loss, but fails DCP save
    
    Supports only simple schedules currently, gpipe and 1f1b.
    
    Ads cmdline/toml arg for specifiying split points, in a unified
    way between tracer or manual frontend.
    
      e.g. user can specifiy "layers.2,layers.4" as split points.
    
    Currently uses manual frontend by default, but allows specifying
    tracer frontend.  Tracer frontend requires working around additional
    compatibility limitations, indicated by raising assertions, and is
    not ready for wider use  yet.
    
    ghstack-source-id: d7e0a1342bc97d6f1bba9e647234d90688ad708f
    Pull Request resolved: pytorch#318
    wconstab committed May 21, 2024
    Configuration menu
    Copy the full SHA
    aafe0e8 View commit details
    Browse the repository at this point in the history

Commits on May 22, 2024

  1. fix i periodic integration test and add helper message on torchdata i…

    …mport failure
    
    ghstack-source-id: 4db9ec111c83f7873253f19f0c95a997800e0f6b
    Pull Request resolved: pytorch#353
    tianyu-l committed May 22, 2024
    Configuration menu
    Copy the full SHA
    60f58b9 View commit details
    Browse the repository at this point in the history
  2. torch.compile each TransformerBlock instead of the whole model (pytor…

    …ch#268)
    
    This way we could temporarily enable 2-D parallel compile, and it might
    make sense to do transformer block compile in the future with PP (which
    we'll see).
    
    We should figure out:
    1. dynamic shape issue when turning on 2D parallel
    2. full model compile issue for 2D parallel compile
    3. cache reusing currently does not work, enable it later
    wanchaol authored May 22, 2024
    Configuration menu
    Copy the full SHA
    9954e19 View commit details
    Browse the repository at this point in the history
  3. Make test_runner use separate logger with default INFO

    previous change to use logging from torchtitan caused stdout not
    to show up.
    
    ghstack-source-id: 30a77c59ba68043ffa844be0443d5351d9584fab
    Pull Request resolved: pytorch#352
    wconstab committed May 22, 2024
    Configuration menu
    Copy the full SHA
    f47f442 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    93a8053 View commit details
    Browse the repository at this point in the history
  5. Fix bug in PP output layer shape

    mostly harmless bug, since output shape of last layer is not used for
    send/recv purpose, the runtime value overrides it no matter what value
    you configured it with.
    
    However, since adding in/out shape validation to pipeline lib in torch,
    this raises an error and has to be fixed.
    
    ghstack-source-id: 950e41529b7b506085ab280d8a492e345eaefd24
    Pull Request resolved: pytorch#354
    wconstab committed May 22, 2024
    Configuration menu
    Copy the full SHA
    0afb276 View commit details
    Browse the repository at this point in the history

Commits on May 23, 2024

  1. Update pipelining import after change on pytorch

    APIs conform to the pytorch rules.  This PR should be able to land
    safely after tonight's nightly pytorch build which includes the above
    PR.
    
    ghstack-source-id: c575bc7835472128c09798544caa38bf1908e5ca
    Pull Request resolved: pytorch#356
    wconstab committed May 23, 2024
    Configuration menu
    Copy the full SHA
    c73a59d View commit details
    Browse the repository at this point in the history

Commits on May 24, 2024

  1. update .gitignore to screen out slew of new temp files (pytorch#359)

    After updating today, I found a whole slew of various new temp files
    clogging up my source tab.
    This PR screens these out so that they don't accidentally get added in a
    PR and keeps your source tab change count correct.
    
    Example of issue without this PR:
    <img width="780" alt="Screenshot 2024-05-23 at 9 21 55 PM"
    src="https://github.com/pytorch/torchtitan/assets/46302957/41b7061a-41a0-4a95-938b-3fd9292a2f38">
    
    vs with this PR:
    <img width="661" alt="Screenshot 2024-05-23 at 10 07 16 PM"
    src="https://github.com/pytorch/torchtitan/assets/46302957/cccf8c5f-368d-40a8-b10f-f11ca37df2bc">
    lessw2020 authored May 24, 2024
    Configuration menu
    Copy the full SHA
    c161119 View commit details
    Browse the repository at this point in the history
  2. Add test for PP tracer frontend

    - switch to using public PipelineStage API
    - clean up some asserts in tracer codepath
    
    ghstack-source-id: 2d069b7d45c4f3c788dec8fc85d8a7e83e463fcd
    Pull Request resolved: pytorch#357
    wconstab committed May 24, 2024
    Configuration menu
    Copy the full SHA
    e593e7d View commit details
    Browse the repository at this point in the history

Commits on May 29, 2024

  1. only produce tensorboard logs on rank 0 by default

    ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243
    Pull Request resolved: pytorch#339
    tianyu-l committed May 29, 2024
    Configuration menu
    Copy the full SHA
    0779207 View commit details
    Browse the repository at this point in the history
  2. replace old torch dependency in requirements.txt

    ghstack-source-id: 8cbd62b97816ae8185b8a7e1aa9a7505f2780525
    Pull Request resolved: pytorch#372
    tianyu-l committed May 29, 2024
    Configuration menu
    Copy the full SHA
    f6ea139 View commit details
    Browse the repository at this point in the history

Commits on May 30, 2024

  1. Add --test option to specify test to run (pytorch#368)

    Usage:
    `--test <test_id>`
    
    Acceptable values: `test_id` in `build_test_list` (default: all)
    
    Example:
    ```
    rm -rf outputs && python test_runner.py outputs --test pp_gpipe
    ```
    kwen2501 authored May 30, 2024
    Configuration menu
    Copy the full SHA
    0fff2d2 View commit details
    Browse the repository at this point in the history
  2. use integration test as the badge shown on the homepage

    ghstack-source-id: 775591945ff5427cb7e5e9fc7592952b4c746341
    Pull Request resolved: pytorch#373
    tianyu-l committed May 30, 2024
    Configuration menu
    Copy the full SHA
    1877738 View commit details
    Browse the repository at this point in the history

Commits on May 31, 2024

  1. keep only latest k checkpoints (pytorch#366)

    Adds a config that purges old checkpoints. Useful for pretraining with
    frequent checkpointing and large step counts.
    liangluofb authored May 31, 2024
    Configuration menu
    Copy the full SHA
    c48ae39 View commit details
    Browse the repository at this point in the history

Commits on Jun 3, 2024

  1. Make seed checkpoint creation work on CPU

    ghstack-source-id: 4eb7a6e10812a11c5fd8589e2ff86e5bdb36f968
    Pull Request resolved: pytorch#377
    wconstab committed Jun 3, 2024
    Configuration menu
    Copy the full SHA
    3227d50 View commit details
    Browse the repository at this point in the history
  2. Fix start/stop layer parsing

    ghstack-source-id: 9d52af302c797e9ac81f1113506f3bab261bf312
    Pull Request resolved: pytorch#380
    wconstab committed Jun 3, 2024
    Configuration menu
    Copy the full SHA
    fbc4aa0 View commit details
    Browse the repository at this point in the history
  3. Use general way to access and update submodules

    ghstack-source-id: ba1d77e5825a26632fe9b7509a88b44509cac45f
    Pull Request resolved: pytorch#381
    kwen2501 committed Jun 3, 2024
    Configuration menu
    Copy the full SHA
    ff3c6e2 View commit details
    Browse the repository at this point in the history

Commits on Jun 4, 2024

  1. Make metrics logging work for pipeline parallelism

    Avoid complicating the ux and leave the status quo of 2 user-selectable
    behaviors:
     - log from rank 0 (the default)
     - log from all ranks (not the default)
    
    Modify the meaning of 'log from rank 0' to log from rank 0 in
    non-pipeline parallel runs, and log from the local rank 0 within the
    last pipeline-parallel stage group if pp is enabled.  (note: earlier
    pipeline stages still produce some metrics like mfu/memory, but do not
    compute loss.)
    
    ghstack-source-id: 7f60d1045f240327ae41ade3a353aff19d2f289a
    Pull Request resolved: pytorch#383
    wconstab committed Jun 4, 2024
    Configuration menu
    Copy the full SHA
    a1f9edb View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2024

  1. [RFC] Allow ModelWrapper and OptimizerWrapper to accept multiple models

    and optimizers
    
    ghstack-source-id: 190220813ece188728a3c776e6839a323009f719
    Pull Request resolved: pytorch#360
    fegin authored and wconstab committed Jun 5, 2024
    Configuration menu
    Copy the full SHA
    9d25778 View commit details
    Browse the repository at this point in the history
  2. Add 3D support

    Enables PP+DP+TP and adds CI test case that runs on 8-gpu CI runner.
    
    ghstack-source-id: 7e2d6879d39e78fc7e6d46fd775bb6dfe08ff708
    Pull Request resolved: pytorch#344
    wconstab committed Jun 5, 2024
    Configuration menu
    Copy the full SHA
    4eb4bfc View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2024

  1. [torchtitan][optim] Add fused as an option in train config (pytorch#355)

    With these three PRs landed, we can now support the option fused=True in
    torchtitan for Adam and AdamW optimizer.
    
    pytorch/pytorch#125369
    pytorch/pytorch#126423
    pytorch/pytorch#126750
    
    Run performance evaluation on 8 A100 DevGPU: 1000 steps on 1D DP default
    [llama_8b.toml](https://github.com/pytorch/torchtitan/blob/main/train_configs/llama3_8b.toml).
    
    Observation: 
    For `fused = True` and `fused = False`, we observed similar loss curve
    and memory usage.
    wps is + ~100 and mfu is + 1.5-2% when fused = True. 
    
    Below are the logs for the last 100 steps for both.
    ```
    **Fused = False**
    [rank0]:2024-06-05 12:45:06,227 - root - INFO - Finished dumping traces in 0.37 seconds
    [rank0]:2024-06-05 12:45:37,677 - root - INFO - step: 910  loss:  4.6039  memory: 59.48GiB(75.15%)  wps: 2,217  mfu: 41.16%
    [rank0]:2024-06-05 12:46:08,843 - root - INFO - step: 920  loss:  4.6427  memory: 59.48GiB(75.15%)  wps: 2,632  mfu: 48.85%
    [rank0]:2024-06-05 12:46:40,052 - root - INFO - step: 930  loss:  4.6339  memory: 59.48GiB(75.15%)  wps: 2,628  mfu: 48.78%
    [rank0]:2024-06-05 12:47:11,243 - root - INFO - step: 940  loss:  4.5964  memory: 59.48GiB(75.15%)  wps: 2,631  mfu: 48.84%
    [rank0]:2024-06-05 12:47:42,655 - root - INFO - step: 950  loss:  4.6477  memory: 59.48GiB(75.15%)  wps: 2,611  mfu: 48.47%
    [rank0]:2024-06-05 12:48:13,890 - root - INFO - step: 960  loss:  4.8137  memory: 59.48GiB(75.15%)  wps: 2,626  mfu: 48.75%
    [rank0]:2024-06-05 12:48:45,110 - root - INFO - step: 970  loss:  4.5962  memory: 59.48GiB(75.15%)  wps: 2,628  mfu: 48.78%
    [rank0]:2024-06-05 12:49:16,333 - root - INFO - step: 980  loss:  4.5450  memory: 59.48GiB(75.15%)  wps: 2,627  mfu: 48.76%
    [rank0]:2024-06-05 12:49:47,561 - root - INFO - step: 990  loss:  4.5840  memory: 59.48GiB(75.15%)  wps: 2,627  mfu: 48.76%
    [rank0]:2024-06-05 12:50:18,933 - root - INFO - step: 1000  loss:  4.5351  memory: 59.48GiB(75.15%)  wps: 2,615  mfu: 48.53%
    [rank0]:2024-06-05 12:50:23,692 - root - INFO - Dumping traces at step 1000
    [rank0]:2024-06-05 12:50:24,041 - root - INFO - Finished dumping traces in 0.35 seconds
    [rank0]:2024-06-05 12:50:24,422 - root - INFO - Sleeping 2 seconds for other ranks to complete
    [rank0]:2024-06-05 12:50:26,424 - root - INFO - Training completed
    
    **Fused = True**
    [rank0]:2024-06-05 14:55:42,894 - root - INFO - Finished dumping traces in 0.30 seconds
    [rank0]:2024-06-05 14:56:13,582 - root - INFO - step: 910  loss:  4.6091  memory: 59.48GiB(75.15%)  wps: 2,341  mfu: 43.46%
    [rank0]:2024-06-05 14:56:43,765 - root - INFO - step: 920  loss:  4.6468  memory: 59.48GiB(75.15%)  wps: 2,718  mfu: 50.45%
    [rank0]:2024-06-05 14:57:13,971 - root - INFO - step: 930  loss:  4.6365  memory: 59.48GiB(75.15%)  wps: 2,715  mfu: 50.40%
    [rank0]:2024-06-05 14:57:44,172 - root - INFO - step: 940  loss:  4.6021  memory: 59.48GiB(75.15%)  wps: 2,716  mfu: 50.41%
    [rank0]:2024-06-05 14:58:14,353 - root - INFO - step: 950  loss:  4.6522  memory: 59.48GiB(75.15%)  wps: 2,718  mfu: 50.45%
    [rank0]:2024-06-05 14:58:44,536 - root - INFO - step: 960  loss:  4.8163  memory: 59.48GiB(75.15%)  wps: 2,717  mfu: 50.44%
    [rank0]:2024-06-05 14:59:14,683 - root - INFO - step: 970  loss:  4.6026  memory: 59.48GiB(75.15%)  wps: 2,721  mfu: 50.51%
    [rank0]:2024-06-05 14:59:44,840 - root - INFO - step: 980  loss:  4.5491  memory: 59.48GiB(75.15%)  wps: 2,720  mfu: 50.49%
    [rank0]:2024-06-05 15:00:15,009 - root - INFO - step: 990  loss:  4.5859  memory: 59.48GiB(75.15%)  wps: 2,719  mfu: 50.47%
    [rank0]:2024-06-05 15:00:45,228 - root - INFO - step: 1000  loss:  4.5396  memory: 59.48GiB(75.15%)  wps: 2,714  mfu: 50.38%
    [rank0]:2024-06-05 15:00:49,455 - root - INFO - Dumping traces at step 1000
    [rank0]:2024-06-05 15:00:49,756 - root - INFO - Finished dumping traces in 0.30 seconds
    [rank0]:2024-06-05 15:00:50,336 - root - INFO - Sleeping 2 seconds for other ranks to complete
    [rank0]:2024-06-05 15:00:52,339 - root - INFO - Training completed
    ```
    wz337 authored Jun 6, 2024
    Configuration menu
    Copy the full SHA
    40f8fd0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3bbe3d9 View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2024

  1. Abstract out out optimizer params and update foreach calling conventi…

    …on (pytorch#386)
    
    # Summary
    Updates the behavior to call foreach when we are not using fused for the
    optimizer
    drisspg authored Jun 7, 2024
    Configuration menu
    Copy the full SHA
    d953107 View commit details
    Browse the repository at this point in the history

Commits on Jun 9, 2024

  1. DeviceMesh BC fix (pytorch#387)

    fix BC issues
    
    There's another pipeline bc issue :(
    wanchaol authored Jun 9, 2024
    Configuration menu
    Copy the full SHA
    cf37b61 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9acdc6f View commit details
    Browse the repository at this point in the history

Commits on Jun 10, 2024

  1. fix missing tb logs

    ghstack-source-id: ac3501485faa093c8b9daacca9917805e2a987b7
    Pull Request resolved: pytorch#389
    tianyu-l committed Jun 10, 2024
    Configuration menu
    Copy the full SHA
    3e5c0aa View commit details
    Browse the repository at this point in the history
  2. add the 8-gpu test badge and use correct links for the integration te…

    …st badges
    
    ghstack-source-id: f198ee40b0d7ee9409feb8fb9539a73b822d756c
    Pull Request resolved: pytorch#390
    tianyu-l committed Jun 10, 2024
    Configuration menu
    Copy the full SHA
    032b9d1 View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2024

  1. Fix 1D PP tracer test

    forgot to enable tracer for tracer test in the last PR
    
    ghstack-source-id: 1cb137911f88daa47b57757346dad55aa736429e
    Pull Request resolved: pytorch#362
    kwen2501 authored and wconstab committed Jun 11, 2024
    Configuration menu
    Copy the full SHA
    91937ef View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2024

  1. del logits=(bs, seq_len, vocab_size) to save 3.9G memory (pytorch#391)

    logits=(bs, seq_len, vocab_size). call `del logits` to free it before
    backward
    
    <img width="1607" alt="Screenshot 2024-06-12 at 11 10 36 AM"
    src="https://github.com/pytorch/torchtitan/assets/134637289/82db2792-59a3-40c4-9591-842be3dd9284">
    weifengpy authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    e29b6b4 View commit details
    Browse the repository at this point in the history
  2. Update contributing.md (pytorch#385)

    small update for contributing.md to include what packages to install and
    how to lint.
    H-Huang authored Jun 12, 2024
    Configuration menu
    Copy the full SHA
    d0b4092 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    000d43f View commit details
    Browse the repository at this point in the history

Commits on Jun 13, 2024

  1. enable TP fp8 allgather with PrepareFloat8ModuleInput (pytorch#393)

    This PR is a follow up PR to enable fp8 allgather in TP after these PR
    landed:
    * pytorch/pytorch#128431
    * pytorch-labs/float8_experimental#275
    
    One need to update their pytorch/float8_experimental to have those
    changes in to train with fp8 changes.
    
    Since fp8 is not enabled as part of our integration tests yet, there
    should be no issues on CI or trains that does not use fp8
    wanchaol authored Jun 13, 2024
    Configuration menu
    Copy the full SHA
    7fcf70d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a6b585f View commit details
    Browse the repository at this point in the history
  3. Fix SAC BC breaking and renaming to ac_freq (pytorch#397)

    as titled, SAC moved to a different public API, move to the new API to
    avoid CI breaking
    wanchaol authored Jun 13, 2024
    Configuration menu
    Copy the full SHA
    0bf344c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    230300b View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2024

  1. enable TritonFusedRMSNorm with local_map annotation (pytorch#404)

    Summary
    This PR enables the use of TritonFusedRMSNorm with Tensor Parallel with
    7%-8% performance gain compared to RMSNorm with TP. pytorch#364
    XilunWu authored Jun 14, 2024
    Configuration menu
    Copy the full SHA
    38496a3 View commit details
    Browse the repository at this point in the history
  2. Cosmetic changes to train.py

    ghstack-source-id: ce4a5b0b6b785ce595487c9d565a8af030c9d07b
    Pull Request resolved: pytorch#398
    kwen2501 committed Jun 14, 2024
    Configuration menu
    Copy the full SHA
    e99f237 View commit details
    Browse the repository at this point in the history
  3. Break down parallelize_llama for inference cases

    ghstack-source-id: fc8e221b5047337f59dea31f2c51d6168fe4fe88
    Pull Request resolved: pytorch#402
    kwen2501 committed Jun 14, 2024
    Configuration menu
    Copy the full SHA
    a96fb82 View commit details
    Browse the repository at this point in the history

Commits on Jun 17, 2024

  1. Change debugmodel to have 8 layers

    - make it possible to choose flavor per-test from test_runner.py
    
    This is useful for PP when more layers == more possibilities for
    schedules/num_stages, but we don't care about having a large model in
    terms of #parameters
    
    ghstack-source-id: fd3076ad591b4f51dd195a78bab5dbe2e4276b18
    Pull Request resolved: pytorch#403
    wconstab committed Jun 17, 2024
    Configuration menu
    Copy the full SHA
    ae3d2a9 View commit details
    Browse the repository at this point in the history

Commits on Jun 18, 2024

  1. Prepare train.py for model chunks for pipelining

    When using pipeline parallelism, a common technique  for reducing bubble
    size is to use schedules that specify more than one model chunk per
    physical rank.  e.g. pp degree 4 could have 8 pipeline stages, and rank
    0 could have stage 0 and stage 4.
    
    To generalize this concept without forking too much code in train.py, I
    make 'model_parts' a new container that either contains one model for
    non-PP or simple PP cases, and contains multiple model parts for complex
    PP cases.
    
    In general, this is tractable becuase we treat each model part the same:
    we create one optimizer per model part, and one lr scheduler per
    optimizer.  We apply spmd and compile individually to each model part.
    The general pattern is to loop over the model parts and perform an
    action on each part, which also works fine if the list size is 1.
    
    The rest of train.py and optimizer/lr_scheduler changes add syntax sugar
    to simplify calling a method on each model part or optimizer part.
    
    ghstack-source-id: fd2982baae0cbeb5dcb695ef6509b7eec3299d6b
    Pull Request resolved: pytorch#406
    wconstab committed Jun 18, 2024
    Configuration menu
    Copy the full SHA
    f8e17f1 View commit details
    Browse the repository at this point in the history

Commits on Jun 19, 2024

  1. dump memory snapshot to analyze OOMs (pytorch#395)

    when setting `enable_memory_snapshot = true` in `.toml`
    * dump memory snapshots in case of OOMs. output folder is
    `memory_snapshot/iteration_x_exit`
    * dump regularly according to `profile_freq`. output folder is
    `memory_snapshot/iteration_x`
    * existing `.toml` works since `enable_memory_snapshot=False` by default
    
    snapshot is an example of the dump when OOM happens
    
    <img width="1640" alt="Screenshot 2024-06-12 at 9 26 53 PM"
    src="https://github.com/pytorch/torchtitan/assets/134637289/6420799c-ae68-4b35-b8bb-f5b6ab3dd053">
    weifengpy authored Jun 19, 2024
    Configuration menu
    Copy the full SHA
    71b70b5 View commit details
    Browse the repository at this point in the history

Commits on Jun 20, 2024

  1. whole_model for fp8 (pytorch#414)

    train.py renamed `model` to `whole_model`
    pytorch#406
    
    fp8 still use `model` thus report error on `model not defined`. this PR
    fixed it
    
    `build_fp8_linear(whole_model, job_config)`
    weifengpy authored Jun 20, 2024
    Configuration menu
    Copy the full SHA
    6117759 View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2024

  1. Add train loop support for looped PP schedules

    - refactor some per-model logic into helper functions
    
    ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea
    Pull Request resolved: pytorch#358
    wconstab committed Jun 21, 2024
    Configuration menu
    Copy the full SHA
    04661a6 View commit details
    Browse the repository at this point in the history

Commits on Jun 25, 2024

  1. Set record_shapes=True for profiler

    ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030
    Pull Request resolved: pytorch#419
    awgu committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    b1340a1 View commit details
    Browse the repository at this point in the history
  2. Improved repeat_kv eager perf

    ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60
    Pull Request resolved: pytorch#418
    awgu committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    be126a6 View commit details
    Browse the repository at this point in the history
  3. Adding FSDP Memory Tracking and Estimation

    ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d
    Pull Request resolved: pytorch#425
    sanketpurandare committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    342a07e View commit details
    Browse the repository at this point in the history
  4. Adding integration test for FSDP Memory Tracking and Estimation

    ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454
    Pull Request resolved: pytorch#426
    sanketpurandare committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    134addd View commit details
    Browse the repository at this point in the history

Commits on Jun 26, 2024

  1. by default disable heavy memory profiling

    ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187
    Pull Request resolved: pytorch#430
    tianyu-l committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    f5171cb View commit details
    Browse the repository at this point in the history

Commits on Jun 27, 2024

  1. Add the option to turn on async-TP

    ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf
    Pull Request resolved: pytorch#429
    yifuwang committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    1ec2ece View commit details
    Browse the repository at this point in the history

Commits on Jul 1, 2024

  1. Modifying memory estimation options and minor changes

    ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4
    Pull Request resolved: pytorch#435
    sanketpurandare committed Jul 1, 2024
    Configuration menu
    Copy the full SHA
    64d47fd View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2024

  1. add comment pointing to Sequence Parallel optimization example

    ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234
    Pull Request resolved: pytorch#438
    tianyu-l committed Jul 8, 2024
    Configuration menu
    Copy the full SHA
    6655204 View commit details
    Browse the repository at this point in the history
  2. switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch…

    …#436)
    
    Summary:
    
    After pytorch-labs/float8_experimental#300,
    `Float8Linear` with default settings is equivalent to
    `Float8DynamicLinear`. This PR changes `torchtitan` to use
    `Float8Linear`.
    
    To support the new UX of `float8_experimental` better, I also switched
    the `fp8_linear` configuration to be a boolean on whether to swap the
    linears or not. In the future we can add new options on how to configure
    each linear (scaling type, scaling granularity, etc) - saving that for a
    future PR.
    
    Test Plan:
    
    ```
    // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
    // verify performance and loss values do not change meaningfully between
    // baseline and this PR
    
    // baseline (before this PR)
    // 1. compile, bf16
    // 2. compile, float8
    // 3. compile, float8, fdsp_fp8_allgather=True
    // 4. compile, float8, fdsp_fp8_allgather=True, tp=2
    // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce
    
    // experiment (this PR): repeat all of the above, but with Float8Linear
    // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    vkuzo authored Jul 8, 2024
    Configuration menu
    Copy the full SHA
    8a1aa06 View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2024

  1. Removed _experimental_support_context_fn_in_torch_utils_checkpoint

    ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f
    Pull Request resolved: pytorch#444
    awgu committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    28762c8 View commit details
    Browse the repository at this point in the history
  2. Reordered TP parallel plan to follow execution order

    ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422
    Pull Request resolved: pytorch#445
    awgu committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    064730a View commit details
    Browse the repository at this point in the history
  3. Made some stylistic changes to apply_dp

    ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f
    Pull Request resolved: pytorch#446
    awgu committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    3e3a913 View commit details
    Browse the repository at this point in the history
  4. Refactored activation checkpointing

    ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d
    Pull Request resolved: pytorch#447
    awgu committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    347ddc0 View commit details
    Browse the repository at this point in the history
  5. compiled RMSNorm

    ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3
    Pull Request resolved: pytorch#442
    tianyu-l committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    3ff7fbb View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2024

  1. Renamed parallel styles for transformer block weights

    ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e
    Pull Request resolved: pytorch#448
    awgu committed Jul 11, 2024
    Configuration menu
    Copy the full SHA
    562d7e2 View commit details
    Browse the repository at this point in the history
  2. Added type annotations and more stylistic changes

    ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5
    Pull Request resolved: pytorch#449
    awgu committed Jul 11, 2024
    Configuration menu
    Copy the full SHA
    0ddf49b View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2024

  1. [Cleanup] Remove libuv from run_llama_train.sh

    libuv is now enabled by default.
    
    we can proably do without the educational blurb there, and don't need
    the env either since the default has landed.
    
    ghstack-source-id: 68c8d2abe7eb0777e2add8df7634367c31b7ec06
    Pull Request resolved: pytorch#453
    wconstab committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    535acf6 View commit details
    Browse the repository at this point in the history
  2. [Cleanup] Organize run_llama_train.sh options

    Just a little code motion but it looks cleaner to me this way
    
    ghstack-source-id: 055fbd557cd9cf189e6b9bd6a7048f1204e1dc5c
    Pull Request resolved: pytorch#454
    wconstab committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    ac72078 View commit details
    Browse the repository at this point in the history
  3. [Cleanup] Split run_llama_train.sh and run_memory_estimation.sh

    Make each script simpler to read
    
    ghstack-source-id: ba3aa65feb6e304736c73daf5bc8ab5fb254f196
    Pull Request resolved: pytorch#455
    wconstab committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    4b6cdc1 View commit details
    Browse the repository at this point in the history
  4. [Cleanup] Remove unused TRAINER_DIR

    This argument seems to be left over from older times- it is not used
    anywhere in the codebase.
    
    ghstack-source-id: abbcf82ed4d1b8fbb71c6a6b48acbc1296dbec64
    Pull Request resolved: pytorch#456
    wconstab committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    8fa11f0 View commit details
    Browse the repository at this point in the history
  5. Add educational code pointers to top level README

    ghstack-source-id: 522aa2fa0bf1679f55d9f3a8a38fdcd319d5e3df
    Pull Request resolved: pytorch#457
    wconstab committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    174c44a View commit details
    Browse the repository at this point in the history

Commits on Jul 16, 2024

  1. enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather (pytorch#413)

    we have landed fp8 all-gather optimizations in float8_experimental
    pytorch-labs/float8_experimental#266
    
    this PR proposes torchtitan changes. also include fp8 in CI
    ```
    from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp
    # inside the training loop
    model(input).sum().backward()
    optim.step()
    precompute_float8_dynamic_scale_for_fsdp(model)
    ```
    
    FSDP2 fp8 all-gather are added to CI
    ```
    CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear
    CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather
    CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp
    ```
    
    TP fp8 all-gather are locally tested. will add them to CI after
    uploading a new tokenizer with vacab size 2560 (divisible by 16)
    ```
    CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4
    CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2
    ```
    
    precompute scales after optimizer.step
    <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM"
    src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817">
    
    FSDP2 pre-all-gather do not have any small all-reduces
    <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM"
    src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08">
    
    TODO
    * upload tokenizer with vacab size 2560 to enable CI on TP fp8
    all-gather
    * torch.compile complains about fp8
    * add delayed scaling and brainstorm about best config option to express
    fp8
    * compare perf between delayed scaling and dynamic scaling
    https://github.com/pytorch-labs/float8_experimental/pull/312/files
    weifengpy authored Jul 16, 2024
    Configuration menu
    Copy the full SHA
    a4b2ee3 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2024

  1. import float8_experimental only when fp8 is enabled and install it in…

    … CI (pytorch#464)
    
    make sure to only import float8_experimental when fp8 is enabled
    
    for 4 gpu CI, make sure we can import float8_experimental correctly in
    CI
    
    `python -m pip install
    git+https://github.com/pytorch-labs/float8_experimental.git`
    weifengpy authored Jul 17, 2024
    Configuration menu
    Copy the full SHA
    ae8181b View commit details
    Browse the repository at this point in the history
  2. skip fp8 CI on non-H100 GPUs (pytorch#465)

    skip fp8 tests on non-H100 GPUs by checking
    `torch.cuda.get_device_capability() >= (9, 0)`
    
    this makes 4 GPU CI healthy again
    weifengpy authored Jul 17, 2024
    Configuration menu
    Copy the full SHA
    3760bcf View commit details
    Browse the repository at this point in the history
  3. clean up float8 configs in torchtitan (pytorch#466)

    Summary:
    
    1. standardizes on `float8` instead of `fp8` for config names
    2. removes usage of non-public objects such as `Float8Linear`
    
    Test Plan:
    
    ```
    with-proxy NGPU=1 CUDA_VISIBLE_DEVICES=7 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.compile --training.enable_float8_linear
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    vkuzo authored Jul 17, 2024
    Configuration menu
    Copy the full SHA
    69fe8de View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2024

  1. Add support of DDP and experimental CompiledAutograd

    Summary:
    Address the comments in pytorch#319 and resubmit the PR to fit the current code base.
    
    Test Plan:
    ```
    CONFIG_FILE=./train_configs/debug_model.toml ./run_llama_train.sh --comm.train_timeout_seconds=3600   --training.tensor_parallel_degree=1 --training.data_parallel_degree=8 --experimental.data_parallel_type=ddp --training.steps=1000 --metrics.log_freq=10 --profiling.profile_freq=1000
    ```
    
    ghstack-source-id: 81dc85d42df13df4ed727bebd825681879af936b
    Pull Request resolved: pytorch#432
    fegin committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    2f989b9 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2024

  1. add torch.compile + FSDP2 float8 all-gather in CI (pytorch#468)

    fixed my bug in float8_experimental. now we can torch.compile
    transfromer blocks with FSDP float8 all-gather
    pytorch-labs/float8_experimental#321
    
    local test: `CONFIG_FILE="./train_configs/debug_model.toml"
    ./run_llama_train.sh --training.enable_float8_linear
    --training.enable_fsdp_float8_all_gather
    --training.precompute_float8_dynamic_scale_for_fsdp --training.compile`
    
    profiler traces: I can see compiled region in cpu thread and float8
    malmul `sm90_xmma_gemm_e4m3bf16...` in cuda stream
    <img width="1468" alt="Screenshot 2024-07-18 at 4 22 17 PM"
    src="https://github.com/user-attachments/assets/0cf58dee-aae1-4582-a3f1-b8aa48b45129">
    weifengpy authored Jul 19, 2024
    Configuration menu
    Copy the full SHA
    71b8eae View commit details
    Browse the repository at this point in the history
  2. [float8] keep model.output as nn.Linear (high precision, not fp8) (p…

    …ytorch#469)
    
    **keep model.output as nn.Linear**: it's a common practice to NOT apply
    fp8 on final output layer
    * specify `skip_fqn_list` in swapping
    * when applying TP to model.output, use plain `ColwiseParallel` instead
    of `Float8ColwiseParallel`
    
    credit to @awgu, we do not need tokentizer vacab size to be divisible by
    16 pytorch#461
    
    1D TP + float8 all-gather, eager mode:
    `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4
    ./run_llama_train.sh --training.enable_float8_linear
    --training.data_parallel_degree 1 --training.tensor_parallel_degree 4`
    
    1D TP + float8 all-gather, compile mode:
    `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4
    ./run_llama_train.sh --training.enable_float8_linear
    --training.data_parallel_degree 1 --training.tensor_parallel_degree 4
    --training.compile`
    
    2D FSDP2 + TP + float8 all-gather, eager mode:
    `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4
    ./run_llama_train.sh --training.enable_float8_linear
    --training.enable_fsdp_float8_all_gather
    --training.precompute_float8_dynamic_scale_for_fsdp
    --training.tensor_parallel_degree 2`
    
    2D FSDP2 + TP + float8 all-gather, eager mode:
    `CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4
    ./run_llama_train.sh --training.enable_float8_linear
    --training.enable_fsdp_float8_all_gather
    --training.precompute_float8_dynamic_scale_for_fsdp
    --training.tensor_parallel_degree 2 --training.compile`
    
    1D TP + float8 all-gather trace: see float8 and all-gather in the trace
    <img width="1611" alt="Screenshot 2024-07-19 at 1 16 59 PM"
    src="https://github.com/user-attachments/assets/9a95dfd9-40e0-4133-b2bb-e22ddf5b8472">
    
    2D + float8 all-gather trace: see float8 and FSDP collectives and TP
    collectives
    <img width="1038" alt="Screenshot 2024-07-19 at 1 29 59 PM"
    src="https://github.com/user-attachments/assets/6a34bcaa-bcae-402b-9994-cc892554fec7">
    weifengpy authored Jul 19, 2024
    Configuration menu
    Copy the full SHA
    0c6f9a2 View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2024

  1. remove CI for FSDP2 + fp8 all-gather (pytorch#470)

    per discussion from
    pytorch#469 (comment)
    
    we are planning BC breaking changes in float8_experimental. remove CI
    for FSDP2 + fp8 all-gather for now. When public APIs are finalized, we
    can discuss bringing it back
    weifengpy authored Jul 20, 2024
    Configuration menu
    Copy the full SHA
    0a17c26 View commit details
    Browse the repository at this point in the history

Commits on Jul 21, 2024

  1. dynamically update torch.compile cache config to ensure async tp supp…

    …ort, enhance async tp UX (pytorch#471)
    
    This PR adds some enhancements for supporting async tp:
    
    1 - if async tp is active, auto updates the torch.dynamo cache limit to
    10K. If this is not updated, async tp will not be activated on larger
    models as it will quietly stop compilation due to 'cache limit reached'
    with no info for the user.
    This config update is logged. 
    
    2 - if async tp is enabled, verifies that torch.compile is set to true
    for this job config. If not, it warns and then activates torch.compile
    to ensure user gets working async tp. (see WARNING in below screenshot)
    
    <img width="1345" alt="Screenshot 2024-07-20 at 4 33 04 PM"
    src="https://github.com/user-attachments/assets/26e5a48e-4bb8-4f33-b1b5-8939c1517c1d">
    
    3 - Updates the 'Applied Tensor Parallel' to the model to be 'Applied
    Async Tensor Parallel' when async tp is active to make it clear in the
    logs which TP is active. (see above screenshot)
    lessw2020 authored Jul 21, 2024
    Configuration menu
    Copy the full SHA
    0ee573c View commit details
    Browse the repository at this point in the history

Commits on Jul 26, 2024

  1. Fix 8gpu PP failure due to 2D DCP disablement

    DCP recently added safeties to avoid using it for 2D/3D since strided
    sharding (a feature needed for safe 2D/3D resharding) is not ready yet.
    
    PP uses DCP to load a seed checkpoint.  Disabling the safety mechanism
    is enough to make 3D/PP still work (for the case where we train from the
    beginning or do not re-shard.
    
    (Resharding refers to saving a checkpoint from one world
    size/parallelism config and loading/resuming under a different one).
    
    ghstack-source-id: c069d2186c79517c72f5b3c99485cebdc15df08f
    Pull Request resolved: pytorch#460
    wconstab committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    69c9bb2 View commit details
    Browse the repository at this point in the history
  2. update float8 integration after UX changes (pytorch#484)

    Summary:
    
    float8_experimental landed various BC-breaking UX changes last week.
    This PR updates torchtitan to work with the version of
    float8_experimental after
    pytorch-labs/float8_experimental#332 and
    pytorch-labs/float8_experimental#337
    
    Test Plan:
    
    ```
    with-proxy CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    vkuzo authored Jul 26, 2024
    Configuration menu
    Copy the full SHA
    90e2070 View commit details
    Browse the repository at this point in the history
  3. Re-enable FSDP2 Mem Tracker integration tests

    ghstack-source-id: 8344603f7a5596cb2909c9bf04dd1b9e4730c9b8
    Pull Request resolved: pytorch#485
    Sanket Jayant Purandare committed Jul 26, 2024
    Configuration menu
    Copy the full SHA
    42f4ff5 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2024

  1. Used partial instead of global vars for LR scheduling

    ghstack-source-id: 12c4418b0574d93e1441f4ca3d1de79c8aad7a40
    Pull Request resolved: pytorch#487
    awgu committed Jul 29, 2024
    Configuration menu
    Copy the full SHA
    a48de09 View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2024

  1. [EZ] Add logs for some basic training params so that we can verify in… (

    pytorch#491)
    
    As title, while testing on 405B model, I found that we need to somehow
    need the logs for some training params. So added some here. Tested
    locally and the logging is shown as in the screenshot:
    
    
    <img width="900" alt="image"
    src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786">
    fduwjj authored Jul 30, 2024
    Configuration menu
    Copy the full SHA
    b63e209 View commit details
    Browse the repository at this point in the history
  2. make float8 scaling type configurable (pytorch#489)

    Summary:
    
    Adds config options to configure float8 scaling type for input, weight,
    grad_output.
    
    Performance is not ideal yet, but that's because we have not optimized
    it.
    
    Test Plan:
    
    ```
    // repeat for input, weight, grad_out
    with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.enable_float8_linear --training.float8_scaling_type_weight delayed --training.compile
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    vkuzo authored Jul 30, 2024
    Configuration menu
    Copy the full SHA
    91f075a View commit details
    Browse the repository at this point in the history
  3. [PP] add flexible interleaved 1f1b schedule pytorch#490 (pytorch#493)

    This was approved in pytorch#490, but
    merged into the wrong branch, merging this into main
    H-Huang authored Jul 30, 2024
    Configuration menu
    Copy the full SHA
    9358d70 View commit details
    Browse the repository at this point in the history
  4. move float8 callsites to torchao.float8 (pytorch#492)

    Summary:
    
    The `float8_experimental` repository moved to `torchao.float8` in
    pytorch/ao#551
    
    This PR updates `torchtitan` to use float8 from the new location.
    
    Test Plan:
    
    ```
    with-proxy CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_float8_linear --training.compile
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    vkuzo authored Jul 30, 2024
    Configuration menu
    Copy the full SHA
    239d56f View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2024

  1. [BE][1/n] simplify train.py

    ghstack-source-id: 3879e764e7b33afde5d778810c71d1d2a8f82f6d
    Pull Request resolved: pytorch#494
    tianyu-l committed Aug 1, 2024
    Configuration menu
    Copy the full SHA
    3c77e9f View commit details
    Browse the repository at this point in the history
  2. [BE][2/n] use proper method signatures in parallelize_llama

    ghstack-source-id: 17a1ee9f03f13423a30183c5c8d7ad30f8c8dbfc
    Pull Request resolved: pytorch#495
    tianyu-l committed Aug 1, 2024
    Configuration menu
    Copy the full SHA
    bf90710 View commit details
    Browse the repository at this point in the history
  3. [BE][3/n] wrap fp8 logic using Float8Handler

    ghstack-source-id: e94c7f6f4fad87c5432262c54beabd02de5541b8
    Pull Request resolved: pytorch#496
    tianyu-l committed Aug 1, 2024
    Configuration menu
    Copy the full SHA
    40f79d7 View commit details
    Browse the repository at this point in the history
  4. Bring LLaMa 3.1 405B to TorchTitan family (pytorch#481)

    With the official launch of LLaMa 3.1 model, we want to add the config
    to TorchTitan. Of course, there are more work to be done, but we want to
    go an incremental way. So more PRs will be needed.
    
    For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The
    perf number is wps: 109 mfu: 29%.
    
    Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4).
    <img width="1037" alt="image"
    src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e">
    
    
    Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4).
    
    ![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0)
    fduwjj authored Aug 1, 2024
    Configuration menu
    Copy the full SHA
    4871358 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2024

  1. [TP] Infer local n_heads instead of ad-hoc model changes

    ghstack-source-id: 587e3d6e5270714ca734b8031ce41a962e6394ea
    Pull Request resolved: pytorch#498
    kwen2501 committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    d41d604 View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2024

  1. some compile-related updates

    ghstack-source-id: 63af8025c184fd5ad34f2f57bf78a37dda2cd33d
    Pull Request resolved: pytorch#443
    tianyu-l committed Aug 3, 2024
    Configuration menu
    Copy the full SHA
    24aef32 View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2024

  1. [EZ][405B] Use scientific notation for 405B model lr (pytorch#504)

    As title, use `8e-5` rather than `0.8e-4`.
    fduwjj authored Aug 5, 2024
    Configuration menu
    Copy the full SHA
    c44cca0 View commit details
    Browse the repository at this point in the history
  2. [BE][4/n] split pipeline_llama into a separate file

    ghstack-source-id: 5ebb4adf3152f413fa33a923c272c9aa3ce1f775
    Pull Request resolved: pytorch#499
    tianyu-l committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    8849580 View commit details
    Browse the repository at this point in the history
  3. [fix] float8 should be applied on all model_parts

    ghstack-source-id: 52ed6836de39e82c4c5824a40ecfc1d9ec7ed2bd
    Pull Request resolved: pytorch#500
    tianyu-l committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    a4d88d1 View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2024

  1. Add warning to compile rmsnorm (pytorch#505)

    as titled, add warning to compile rmsnorm as it's not fully ready yet,
    i.e. this issue pytorch#497
    
    We can remove this warning once we fix the issue
    wanchaol authored Aug 6, 2024
    Configuration menu
    Copy the full SHA
    1a303b3 View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2024

  1. add float8 to README (pytorch#509)

    add float8 link in README so we can redirect people from dev-discuss
    post to torchtitan repo
    
    
    README looks like this after rendering
    <img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM"
    src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4">
    
    float8.md looks like this
    <img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM"
    src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4">
    
    I tried the command locally and traces are looking good
    <img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM"
    src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb">
    weifengpy authored Aug 7, 2024
    Configuration menu
    Copy the full SHA
    b99bc5e View commit details
    Browse the repository at this point in the history
  2. address TODOs as 2D recompiles is fixed

    ghstack-source-id: 2927f0a8082171da3e9f59a5d04f8325cbdf3653
    Pull Request resolved: pytorch#508
    tianyu-l committed Aug 7, 2024
    Configuration menu
    Copy the full SHA
    fa8cdd4 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2024

  1. [BE][5/n] simply pp vs. non-pp set up

    ghstack-source-id: 003bfbfbcf1511ddbd18e15d031b39f597d8e7db
    Pull Request resolved: pytorch#510
    tianyu-l committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    d6e3f77 View commit details
    Browse the repository at this point in the history
  2. [BE][6/n] replace large c4_mini datasets by c4_test with the first 2K…

    … entries
    
    ghstack-source-id: 319f4961b092778703101b98937803073132afa1
    Pull Request resolved: pytorch#512
    tianyu-l committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    34fa017 View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2024

  1. Create composability.md (pytorch#511)

    Explain the rationale and challenges behind certain changes we made to
    llama model to support 3D parallelism.
    
    ---------
    
    Co-authored-by: tianyu-l <[email protected]>
    wconstab and tianyu-l authored Aug 9, 2024
    Configuration menu
    Copy the full SHA
    9de54a5 View commit details
    Browse the repository at this point in the history
  2. depend on torchdata 0.8.0 instead of nightly

    ghstack-source-id: 1965d3122885fed3c28e2e058c55581187e7816c
    Pull Request resolved: pytorch#513
    tianyu-l committed Aug 9, 2024
    Configuration menu
    Copy the full SHA
    b41b41b View commit details
    Browse the repository at this point in the history

Commits on Aug 12, 2024

  1. [PP] Bypass seed checkpoint my init-ing model parts separately (pytor…

    …ch#516)
    
    Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
    bottom):
    * pytorch#473
    * pytorch#517
    * __->__ pytorch#516
    
    Allows PP to be used without a seed checkpoint by calling `init_weight`
    on each model part. This is the solution in step 1 of
    pytorch#514 proposed by @wconstab
    H-Huang authored Aug 12, 2024
    Configuration menu
    Copy the full SHA
    a4bc948 View commit details
    Browse the repository at this point in the history
  2. [small] format composability.md (pytorch#517)

    Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
    bottom):
    * pytorch#473
    * __->__ pytorch#517
    * pytorch#516
    
    Ran `pre-commit run --all-files`
    H-Huang authored Aug 12, 2024
    Configuration menu
    Copy the full SHA
    a47a5a9 View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2024

  1. Throw warning if users are using old pytorch version that not includi…

    …ng DTensor strided sharding (pytorch#507)
    
    Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
    bottom):
    * __->__ pytorch#507
    
    **Summary**
    1. check if users are using new nightly-build pytorch that includes
    DTensor strided sharding
    (pytorch/pytorch#130760) when 2D/3D is used.
    Print warning if not.
    2. remove temporary re-enablement added in pytorch#460 .
    
    **Test**
    Command: `python test_runner.py outputs --test pp_dp_tp --ngpu 8`
    GPUs: A100
    Output:
    - without strided sharding:
    ```
    [rank7]:2024-08-06 03:21:26,706 - root - INFO - step:  2  loss:  8.1652  memory:  0.51GiB(0.64%)  wps: 8,250  mfu: 0.25%
    [rank7]:2024-08-06 03:21:27,013 - root - INFO - step:  3  loss:  8.0951  memory:  0.51GiB(0.64%)  wps: 13,358  mfu: 0.41%
    [rank7]:2024-08-06 03:21:27,309 - root - INFO - step:  4  loss:  7.9748  memory:  0.51GiB(0.64%)  wps: 13,865  mfu: 0.42%
    [rank7]:2024-08-06 03:21:27,582 - root - INFO - step:  5  loss:  7.8025  memory:  0.51GiB(0.64%)  wps: 15,057  mfu: 0.46%
    [rank7]:2024-08-06 03:21:28,076 - root - INFO - step:  6  loss:  7.5612  memory:  0.51GiB(0.64%)  wps: 8,300  mfu: 0.25%
    [rank7]:2024-08-06 03:21:28,608 - root - INFO - step:  7  loss:  7.3649  memory:  0.51GiB(0.64%)  wps: 7,705  mfu: 0.23%
    [rank7]:2024-08-06 03:21:28,927 - root - INFO - step:  8  loss:  7.2946  memory:  0.51GiB(0.64%)  wps: 12,832  mfu: 0.39%
    [rank7]:2024-08-06 03:21:29,251 - root - INFO - step:  9  loss:  7.1311  memory:  0.51GiB(0.64%)  wps: 12,669  mfu: 0.38%
    [rank7]:2024-08-06 03:21:29,627 - root - INFO - step: 10  loss:  7.0540  memory:  0.51GiB(0.64%)  wps: 10,918  mfu: 0.33%
    >>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<<
    [rank7]:2024-08-06 03:21:59,723 - root - INFO - step: 11  loss:  7.0822  memory:  0.51GiB(0.64%)  wps: 1,139  mfu: 0.03%
    [rank7]:2024-08-06 03:22:00,054 - root - INFO - step: 12  loss:  7.0508  memory:  0.51GiB(0.64%)  wps: 12,366  mfu: 0.38%
    [rank7]:2024-08-06 03:22:00,340 - root - INFO - step: 13  loss:  6.9182  memory:  0.51GiB(0.64%)  wps: 14,370  mfu: 0.44%
    [rank7]:2024-08-06 03:22:00,624 - root - INFO - step: 14  loss:  6.8948  memory:  0.51GiB(0.64%)  wps: 14,442  mfu: 0.44%
    [rank7]:2024-08-06 03:22:00,907 - root - INFO - step: 15  loss:  6.8358  memory:  0.51GiB(0.64%)  wps: 14,514  mfu: 0.44%
    [rank7]:2024-08-06 03:22:01,574 - root - INFO - step: 16  loss:  6.7653  memory:  0.51GiB(0.64%)  wps: 6,144  mfu: 0.19%
    [rank7]:2024-08-06 03:22:02,209 - root - INFO - step: 17  loss:  6.7340  memory:  0.51GiB(0.64%)  wps: 6,453  mfu: 0.20%
    [rank7]:2024-08-06 03:22:02,532 - root - INFO - step: 18  loss:  6.6874  memory:  0.51GiB(0.64%)  wps: 12,695  mfu: 0.39%
    [rank7]:2024-08-06 03:22:02,863 - root - INFO - step: 19  loss:  6.6566  memory:  0.51GiB(0.64%)  wps: 12,406  mfu: 0.38%
    [rank7]:2024-08-06 03:22:03,257 - root - INFO - step: 20  loss:  6.6629  memory:  0.51GiB(0.64%)  wps: 10,392  mfu: 0.32%
    ```
    - with strided sharding
    ```
    [rank7]:2024-08-06 03:26:18,288 - root - INFO - step:  1  loss:  8.2069  memory:  0.50GiB(0.63%)  wps: 915  mfu: 0.03%
    [rank7]:2024-08-06 03:26:19,084 - root - INFO - step:  2  loss:  8.1913  memory:  0.51GiB(0.64%)  wps: 5,144  mfu: 0.16%
    [rank7]:2024-08-06 03:26:19,365 - root - INFO - step:  3  loss:  8.1148  memory:  0.51GiB(0.64%)  wps: 14,593  mfu: 0.44%
    [rank7]:2024-08-06 03:26:19,698 - root - INFO - step:  4  loss:  7.9982  memory:  0.51GiB(0.64%)  wps: 12,328  mfu: 0.37%
    [rank7]:2024-08-06 03:26:20,011 - root - INFO - step:  5  loss:  7.8382  memory:  0.51GiB(0.64%)  wps: 13,100  mfu: 0.40%
    [rank7]:2024-08-06 03:26:20,498 - root - INFO - step:  6  loss:  7.6293  memory:  0.51GiB(0.64%)  wps: 8,423  mfu: 0.26%
    [rank7]:2024-08-06 03:26:21,126 - root - INFO - step:  7  loss:  7.4454  memory:  0.51GiB(0.64%)  wps: 6,530  mfu: 0.20%
    [rank7]:2024-08-06 03:26:21,472 - root - INFO - step:  8  loss:  7.3337  memory:  0.51GiB(0.64%)  wps: 11,843  mfu: 0.36%
    [rank7]:2024-08-06 03:26:21,849 - root - INFO - step:  9  loss:  7.1960  memory:  0.51GiB(0.64%)  wps: 10,892  mfu: 0.33%
    [rank7]:2024-08-06 03:26:22,229 - root - INFO - step: 10  loss:  7.1208  memory:  0.51GiB(0.64%)  wps: 10,798  mfu: 0.33%
    >>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<<
    [rank7]:2024-08-06 03:26:50,306 - root - INFO - step: 11  loss:  7.1222  memory:  0.51GiB(0.64%)  wps: 866  mfu: 0.03%
    [rank7]:2024-08-06 03:26:50,632 - root - INFO - step: 12  loss:  7.1189  memory:  0.51GiB(0.64%)  wps: 12,589  mfu: 0.38%
    [rank7]:2024-08-06 03:26:50,917 - root - INFO - step: 13  loss:  6.9646  memory:  0.51GiB(0.64%)  wps: 14,417  mfu: 0.44%
    [rank7]:2024-08-06 03:26:51,217 - root - INFO - step: 14  loss:  6.9626  memory:  0.51GiB(0.64%)  wps: 13,680  mfu: 0.42%
    [rank7]:2024-08-06 03:26:51,514 - root - INFO - step: 15  loss:  6.8694  memory:  0.51GiB(0.64%)  wps: 13,799  mfu: 0.42%
    [rank7]:2024-08-06 03:26:52,207 - root - INFO - step: 16  loss:  6.7994  memory:  0.51GiB(0.64%)  wps: 5,910  mfu: 0.18%
    [rank7]:2024-08-06 03:26:53,053 - root - INFO - step: 17  loss:  6.7634  memory:  0.51GiB(0.64%)  wps: 4,847  mfu: 0.15%
    [rank7]:2024-08-06 03:26:53,370 - root - INFO - step: 18  loss:  6.7233  memory:  0.51GiB(0.64%)  wps: 12,915  mfu: 0.39%
    [rank7]:2024-08-06 03:26:53,686 - root - INFO - step: 19  loss:  6.7054  memory:  0.51GiB(0.64%)  wps: 12,995  mfu: 0.39%
    [rank7]:2024-08-06 03:26:54,059 - root - INFO - step: 20  loss:  6.7130  memory:  0.51GiB(0.64%)  wps: 10,991  mfu: 0.33%
    ```
    XilunWu authored Aug 13, 2024
    Configuration menu
    Copy the full SHA
    36a0057 View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2024

  1. Update fsdp.md (pytorch#519)

    `torch.nn.Module.to_empty` takes keyword only arg of "device" according
    to
    https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to_empty
    crcrpar authored Aug 14, 2024
    Configuration menu
    Copy the full SHA
    1c96a01 View commit details
    Browse the repository at this point in the history

Commits on Aug 15, 2024

  1. remove old torch dependency in requirements.txt

    ghstack-source-id: 7e1c7071f8126072ab0e25194b75f280bf4277ec
    Pull Request resolved: pytorch#523
    tianyu-l committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    6c16807 View commit details
    Browse the repository at this point in the history

Commits on Aug 16, 2024

  1. Configuration menu
    Copy the full SHA
    f339363 View commit details
    Browse the repository at this point in the history
  2. uniformly use skip for both (map-style) Dataset and IterableDataset

    ghstack-source-id: c8f611742ffbb4859988b97e706b9e0d1b4ad6f1
    Pull Request resolved: pytorch#521
    tianyu-l committed Aug 16, 2024
    Configuration menu
    Copy the full SHA
    81c555f View commit details
    Browse the repository at this point in the history

Commits on Aug 20, 2024

  1. Configuration menu
    Copy the full SHA
    57c3400 View commit details
    Browse the repository at this point in the history