Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove tensor and pipeline parallelism related code #1

Merged
merged 265 commits into from
Aug 20, 2024

Conversation

philippguevorguian
Copy link
Collaborator

No description provided.

lessw2020 and others added 30 commits February 26, 2024 10:16
…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87)

This PR adds basic perf timing and display for 'per iter' and 'final
iter average' display. (in part based on Andrew's comment about having
to open the trace to compare iter timing).

1. tracking list is housed in TrainState, but I do not save it as part
of the state dict as I view this as useful but not saveable info.
2. iter times are tracked after dataloading is done each iter and after
optimizer step. The idea is to make this timing expressly the model
training iter (not data loading or post iter other metrics calcs).

3. 'time' is now displayed at each iter along with the usual loss and
lr.

4. at the end of training, assuming more than 3 iters run, then the
average iter time is calculated by igoring the first three iters
(consider these as warmup esp as cudaCacheAllocator gets warmed up) and
displayed.
5. based on @tianyu-l feedback: I have added data loading times as well.
I used the same timeit.default_timer() from timeit to be consistent.
(cpu side so no synch's needed :)

6 - after fiddling with printf width formatting options, added beautiful
aligned columnar display for the per iter updates:
Now: 
<img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66">

before: 
<img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">
Summary:

Summary:
Follow up on config unification, options not available in config file
are picked from command line defaults.

Test Plan:
============================= test session starts
============================== platform linux -- Python 3.10.13,
pytest-8.0.1, pluggy-1.4.0 --
/home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache
rootdir: /data/users/gnadathur/a/torchtrain
configfile: pyproject.toml
plugins: cov-4.1.0
collecting ... collected 3 items

test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [
33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED
[ 66%]
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
PASSED [100%]

---------- coverage: platform linux, python 3.10.13-final-0 ----------
Coverage XML written to file coverage.xml

============================= slowest 20 durations
============================= 0.00s call
test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call
test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
call
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s setup
test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
teardown test/test_job_config.py::TestJobConfig::test_command_line_args
0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file
0.00s setup
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s teardown
test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s
teardown
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
============================== 3 passed in 0.06s
===============================

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <[email protected]>
ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22
Pull Request resolved: pytorch#92
…g on slurm (pytorch#93)

This PR adds the ability to do colored console outputs in order to
highlight the training data outputs.
It also adds a check to not use this color formatting on slurm, where it
will add 33= instead of the color if not avoided.

Note that I've just added some color to highlight the main training
data. Users that fork/clone can use it to enhance their outputs as
desired.

<img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6">


Note that on slurm it remains plain:
<img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2">

if you dont' check this, then it would otherwise look like this (this
does not happen with this PR, just showing if we didn't check and credit
to Yifu for noting this would be an issue):
<img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">
this PR updates the GPU metrics to labelling as GiB - we were
calculating GiB but calling it GB.
(credit to @awgu for flagging this - issue
pytorch#94)

function names and member vars in metrics.py have been updated to _gib
instead of _gb for clarity, and the logging output now labels as GiB:
<img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">
ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7
Pull Request resolved: pytorch#96
Enable libUV for torchtrain.

Test:
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] 
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
[rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama
[rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912.
[rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep:  1  �[32mloss: 10.9229  �[39miter: �[34m 1.9386�[39m  data: �[34m0.0368  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep:  2  �[32mloss: 10.8673  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep:  3  �[32mloss: 10.7145  �[39miter: �[34m 0.0523�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep:  4  �[32mloss: 10.5038  �[39miter: �[34m 0.0559�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep:  5  �[32mloss: 10.2228  �[39miter: �[34m 0.0537�[39m  data: �[34m0.031  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep:  6  �[32mloss:  9.9677  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0302  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep:  7  �[32mloss:  9.7762  �[39miter: �[34m 0.0544�[39m  data: �[34m0.0317  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep:  8  �[32mloss:  9.4359  �[39miter: �[34m 0.0509�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep:  9  �[32mloss:  9.2326  �[39miter: �[34m 0.1007�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10  �[32mloss:  9.1641  �[39miter: �[34m 0.0971�[39m  data: �[34m0.031  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds
[rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds
[rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

---------

Co-authored-by: gnadathur <[email protected]>
as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
Add 7b config and adjust options to be more realistic

didn't add this to the train scripts as default as it's expensive to
init, whoever use it can adjust it accordingly
ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40
Pull Request resolved: pytorch#97
Summary:
Adding a description field, useful for integration tests to describe the
test.

Test Plan:
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
[rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama
[rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705.
[rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep:  1  �[32mloss: 10.9183  �[39miter: �[34m 1.9258�[39m  data: �[34m0.0303  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep:  2  �[32mloss: 10.8347  �[39miter: �[34m 0.0487�[39m  data: �[34m0.0336  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep:  3  �[32mloss: 10.6861  �[39miter: �[34m  0.045�[39m  data: �[34m0.0334  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep:  4  �[32mloss: 10.4672  �[39miter: �[34m 0.0453�[39m  data: �[34m0.0336  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep:  5  �[32mloss: 10.2154  �[39miter: �[34m 0.0466�[39m  data: �[34m0.033  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep:  6  �[32mloss:  9.9573  �[39miter: �[34m 0.0496�[39m  data: �[34m0.0314  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep:  7  �[32mloss:  9.7627  �[39miter: �[34m 0.0486�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep:  8  �[32mloss:   9.437  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep:  9  �[32mloss:  9.2446  �[39miter: �[34m 0.0794�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10  �[32mloss:  9.1772  �[39miter: �[34m 0.0893�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur <[email protected]>
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
Pull Request resolved: pytorch#105
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
[rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
[rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Co-authored-by: gnadathur <[email protected]>
This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened pytorch#110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).
ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d
Pull Request resolved: pytorch#114
ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c
Pull Request resolved: pytorch#112
…data' at 40 iters issue) (pytorch#88)

This PR add's minipile (1M, 6GB) dataset as an option for pretraining
with torchtrain.
It resolves the issue where we run out of data after 40 iterations with
the default alpaca dataset.
Per @tianyu-l's excellent suggestion, have refactored to have a single
hf_datasets.py file that supports both minipile and alpaca since it
turned out no need for any different tokenizer, etc.
Also cleaned up the datasets package so that create_tokenizer is exposed
directly, and thus all public apis can be used directly from
torchtrain.datasets.
Lastly - added warning if/when a dataset is being re-looped so users
don't get burned by overfitting:
<img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a">


Adds a color highlight to showcase what dataloader was built:
<img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0">
and
<img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e">


Usage:
just add "minipile" or "alpaca" as the dataset in the training config
toml file.
<img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27">

Testing:
verified training loss is improving and ran for 100 iters to verify no
issue with out of data any longer with minipile.
reran with alpaca and saw the expected out of data at 40 iters without
infinite loop option, runs to 100 with infinite.

Notes:
I did not make this a default dataset since for debugmodel, mostly
running 10 iters is fine and there's 6GB to pull down.
<img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">
ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6
Pull Request resolved: pytorch#117
ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad
Pull Request resolved: pytorch#121
ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8
Pull Request resolved: pytorch#122
according to suggestions in pytorch#118 (comment)

ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651
Pull Request resolved: pytorch#123
…t job configs (pytorch#124)

This PR:
1 - adds the english language portion of c4 dataset, which has 177M
entries. (https://huggingface.co/datasets/allenai/c4)

Due to the size, streaming is enabled as the default.  
This is the allen-ai/c4, as apparently the original c4 is being
deprecated and HF advises to use allen-ai now.

For comparison per @tianyu-l request - 40 iterations avg time:
alpaca cached loading: Average data load time: 0.0279 seconds
c4 streaming loading: Average data load time: 0.0290 seconds

There is a longer initial delay during the 'preparing c4' vs alpaca
(i.e. 45 seconds vs 10 seconds), but after that speed is similar.

Dataset sample (not displayed in training, just an excerpt I pulled to
double check the data flow):
<img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121">

2 - I also updated the multi-node slurm file to account for the new job
config.

Test:
verified no looping with 100 iterations, 
sampled data streamed to verify.
…ytorch#130)

This PR adds the openwebtext 1M dataset. 
This is a homogenous dataset, so we are able to train successfully while
not having any shuffle in our dataset loader.

1 - adds the dateset to hf_datasets
2 - makes the default dataset for 13b and 70b as openwebtext since that
is the preferred choice for larger scale training.

Testing - ran 5K iters (9 nodes) to verify no spiking issues:

<img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">
…pytorch#131)

This fix would temporarily unblock loading. So we won't run into the
issue of:

```
[rank0]:[rank0]:     train_state.losses.append(train_state.current_loss)
[rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append'
```

However, current_loss and losses are still not correct, since by current
setup, losses and current_losses would be different across different
ranks. Also, we don't know the size of losses because this is based on
the # of steps. So loading still work but the value of current_loss and
losses are not being loaded correctly.

I will follow up with further fixes.
ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a
Pull Request resolved: pytorch#132
simplify things given we already have SequenceParallel style landed in
main
ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f
Pull Request resolved: pytorch#137
ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099
Pull Request resolved: pytorch#136
fduwjj and others added 18 commits August 1, 2024 16:16
With the official launch of LLaMa 3.1 model, we want to add the config
to TorchTitan. Of course, there are more work to be done, but we want to
go an incremental way. So more PRs will be needed.

For now, we try on 128 GPUs with current config (TP=8, FSDP=16). The
perf number is wps: 109 mfu: 29%.

Loss curve for 3000 steps with 600 warmup (lr = 0.8e-4).
<img width="1037" alt="image"
src="https://github.com/user-attachments/assets/f57dd3fa-07d8-4ef4-8f68-8f7a08e9652e">


Loss curve for 3000 steps with 600 warmup (lr = 1.1e-4).

![image](https://github.com/user-attachments/assets/429b9738-94cb-4b37-90ef-049a5587ddd0)
ghstack-source-id: 587e3d6e5270714ca734b8031ce41a962e6394ea
Pull Request resolved: pytorch#498
ghstack-source-id: 63af8025c184fd5ad34f2f57bf78a37dda2cd33d
Pull Request resolved: pytorch#443
ghstack-source-id: 5ebb4adf3152f413fa33a923c272c9aa3ce1f775
Pull Request resolved: pytorch#499
ghstack-source-id: 52ed6836de39e82c4c5824a40ecfc1d9ec7ed2bd
Pull Request resolved: pytorch#500
as titled, add warning to compile rmsnorm as it's not fully ready yet,
i.e. this issue pytorch#497

We can remove this warning once we fix the issue
add float8 link in README so we can redirect people from dev-discuss
post to torchtitan repo


README looks like this after rendering
<img width="518" alt="Screenshot 2024-08-06 at 5 42 10 PM"
src="https://github.com/user-attachments/assets/50af99d7-93be-459a-89d7-8c08b8fb95d4">

float8.md looks like this
<img width="563" alt="Screenshot 2024-08-06 at 5 04 17 PM"
src="https://github.com/user-attachments/assets/06d30aad-4133-4cec-9037-cfcf155b45c4">

I tried the command locally and traces are looking good
<img width="726" alt="Screenshot 2024-08-06 at 5 00 00 PM"
src="https://github.com/user-attachments/assets/bdfa3d7e-efe1-4009-92a1-0f5c310013fb">
ghstack-source-id: 2927f0a8082171da3e9f59a5d04f8325cbdf3653
Pull Request resolved: pytorch#508
ghstack-source-id: 003bfbfbcf1511ddbd18e15d031b39f597d8e7db
Pull Request resolved: pytorch#510
… entries

ghstack-source-id: 319f4961b092778703101b98937803073132afa1
Pull Request resolved: pytorch#512
Explain the rationale and challenges behind certain changes we made to
llama model to support 3D parallelism.

---------

Co-authored-by: tianyu-l <[email protected]>
ghstack-source-id: 1965d3122885fed3c28e2e058c55581187e7816c
Pull Request resolved: pytorch#513
…ch#516)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* pytorch#473
* pytorch#517
* __->__ pytorch#516

Allows PP to be used without a seed checkpoint by calling `init_weight`
on each model part. This is the solution in step 1 of
pytorch#514 proposed by @wconstab
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* pytorch#473
* __->__ pytorch#517
* pytorch#516

Ran `pre-commit run --all-files`
…ng DTensor strided sharding (pytorch#507)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ pytorch#507

**Summary**
1. check if users are using new nightly-build pytorch that includes
DTensor strided sharding
(pytorch/pytorch#130760) when 2D/3D is used.
Print warning if not.
2. remove temporary re-enablement added in pytorch#460 .

**Test**
Command: `python test_runner.py outputs --test pp_dp_tp --ngpu 8`
GPUs: A100
Output:
- without strided sharding:
```
[rank7]:2024-08-06 03:21:26,706 - root - INFO - step:  2  loss:  8.1652  memory:  0.51GiB(0.64%)  wps: 8,250  mfu: 0.25%
[rank7]:2024-08-06 03:21:27,013 - root - INFO - step:  3  loss:  8.0951  memory:  0.51GiB(0.64%)  wps: 13,358  mfu: 0.41%
[rank7]:2024-08-06 03:21:27,309 - root - INFO - step:  4  loss:  7.9748  memory:  0.51GiB(0.64%)  wps: 13,865  mfu: 0.42%
[rank7]:2024-08-06 03:21:27,582 - root - INFO - step:  5  loss:  7.8025  memory:  0.51GiB(0.64%)  wps: 15,057  mfu: 0.46%
[rank7]:2024-08-06 03:21:28,076 - root - INFO - step:  6  loss:  7.5612  memory:  0.51GiB(0.64%)  wps: 8,300  mfu: 0.25%
[rank7]:2024-08-06 03:21:28,608 - root - INFO - step:  7  loss:  7.3649  memory:  0.51GiB(0.64%)  wps: 7,705  mfu: 0.23%
[rank7]:2024-08-06 03:21:28,927 - root - INFO - step:  8  loss:  7.2946  memory:  0.51GiB(0.64%)  wps: 12,832  mfu: 0.39%
[rank7]:2024-08-06 03:21:29,251 - root - INFO - step:  9  loss:  7.1311  memory:  0.51GiB(0.64%)  wps: 12,669  mfu: 0.38%
[rank7]:2024-08-06 03:21:29,627 - root - INFO - step: 10  loss:  7.0540  memory:  0.51GiB(0.64%)  wps: 10,918  mfu: 0.33%
>>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<<
[rank7]:2024-08-06 03:21:59,723 - root - INFO - step: 11  loss:  7.0822  memory:  0.51GiB(0.64%)  wps: 1,139  mfu: 0.03%
[rank7]:2024-08-06 03:22:00,054 - root - INFO - step: 12  loss:  7.0508  memory:  0.51GiB(0.64%)  wps: 12,366  mfu: 0.38%
[rank7]:2024-08-06 03:22:00,340 - root - INFO - step: 13  loss:  6.9182  memory:  0.51GiB(0.64%)  wps: 14,370  mfu: 0.44%
[rank7]:2024-08-06 03:22:00,624 - root - INFO - step: 14  loss:  6.8948  memory:  0.51GiB(0.64%)  wps: 14,442  mfu: 0.44%
[rank7]:2024-08-06 03:22:00,907 - root - INFO - step: 15  loss:  6.8358  memory:  0.51GiB(0.64%)  wps: 14,514  mfu: 0.44%
[rank7]:2024-08-06 03:22:01,574 - root - INFO - step: 16  loss:  6.7653  memory:  0.51GiB(0.64%)  wps: 6,144  mfu: 0.19%
[rank7]:2024-08-06 03:22:02,209 - root - INFO - step: 17  loss:  6.7340  memory:  0.51GiB(0.64%)  wps: 6,453  mfu: 0.20%
[rank7]:2024-08-06 03:22:02,532 - root - INFO - step: 18  loss:  6.6874  memory:  0.51GiB(0.64%)  wps: 12,695  mfu: 0.39%
[rank7]:2024-08-06 03:22:02,863 - root - INFO - step: 19  loss:  6.6566  memory:  0.51GiB(0.64%)  wps: 12,406  mfu: 0.38%
[rank7]:2024-08-06 03:22:03,257 - root - INFO - step: 20  loss:  6.6629  memory:  0.51GiB(0.64%)  wps: 10,392  mfu: 0.32%
```
- with strided sharding
```
[rank7]:2024-08-06 03:26:18,288 - root - INFO - step:  1  loss:  8.2069  memory:  0.50GiB(0.63%)  wps: 915  mfu: 0.03%
[rank7]:2024-08-06 03:26:19,084 - root - INFO - step:  2  loss:  8.1913  memory:  0.51GiB(0.64%)  wps: 5,144  mfu: 0.16%
[rank7]:2024-08-06 03:26:19,365 - root - INFO - step:  3  loss:  8.1148  memory:  0.51GiB(0.64%)  wps: 14,593  mfu: 0.44%
[rank7]:2024-08-06 03:26:19,698 - root - INFO - step:  4  loss:  7.9982  memory:  0.51GiB(0.64%)  wps: 12,328  mfu: 0.37%
[rank7]:2024-08-06 03:26:20,011 - root - INFO - step:  5  loss:  7.8382  memory:  0.51GiB(0.64%)  wps: 13,100  mfu: 0.40%
[rank7]:2024-08-06 03:26:20,498 - root - INFO - step:  6  loss:  7.6293  memory:  0.51GiB(0.64%)  wps: 8,423  mfu: 0.26%
[rank7]:2024-08-06 03:26:21,126 - root - INFO - step:  7  loss:  7.4454  memory:  0.51GiB(0.64%)  wps: 6,530  mfu: 0.20%
[rank7]:2024-08-06 03:26:21,472 - root - INFO - step:  8  loss:  7.3337  memory:  0.51GiB(0.64%)  wps: 11,843  mfu: 0.36%
[rank7]:2024-08-06 03:26:21,849 - root - INFO - step:  9  loss:  7.1960  memory:  0.51GiB(0.64%)  wps: 10,892  mfu: 0.33%
[rank7]:2024-08-06 03:26:22,229 - root - INFO - step: 10  loss:  7.1208  memory:  0.51GiB(0.64%)  wps: 10,798  mfu: 0.33%
>>>>>>>>>>>>>>>>>Checkpoint save & load<<<<<<<<<<<<<<<<<<<
[rank7]:2024-08-06 03:26:50,306 - root - INFO - step: 11  loss:  7.1222  memory:  0.51GiB(0.64%)  wps: 866  mfu: 0.03%
[rank7]:2024-08-06 03:26:50,632 - root - INFO - step: 12  loss:  7.1189  memory:  0.51GiB(0.64%)  wps: 12,589  mfu: 0.38%
[rank7]:2024-08-06 03:26:50,917 - root - INFO - step: 13  loss:  6.9646  memory:  0.51GiB(0.64%)  wps: 14,417  mfu: 0.44%
[rank7]:2024-08-06 03:26:51,217 - root - INFO - step: 14  loss:  6.9626  memory:  0.51GiB(0.64%)  wps: 13,680  mfu: 0.42%
[rank7]:2024-08-06 03:26:51,514 - root - INFO - step: 15  loss:  6.8694  memory:  0.51GiB(0.64%)  wps: 13,799  mfu: 0.42%
[rank7]:2024-08-06 03:26:52,207 - root - INFO - step: 16  loss:  6.7994  memory:  0.51GiB(0.64%)  wps: 5,910  mfu: 0.18%
[rank7]:2024-08-06 03:26:53,053 - root - INFO - step: 17  loss:  6.7634  memory:  0.51GiB(0.64%)  wps: 4,847  mfu: 0.15%
[rank7]:2024-08-06 03:26:53,370 - root - INFO - step: 18  loss:  6.7233  memory:  0.51GiB(0.64%)  wps: 12,915  mfu: 0.39%
[rank7]:2024-08-06 03:26:53,686 - root - INFO - step: 19  loss:  6.7054  memory:  0.51GiB(0.64%)  wps: 12,995  mfu: 0.39%
[rank7]:2024-08-06 03:26:54,059 - root - INFO - step: 20  loss:  6.7130  memory:  0.51GiB(0.64%)  wps: 10,991  mfu: 0.33%
```
`torch.nn.Module.to_empty` takes keyword only arg of "device" according
to
https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to_empty
tianyu-l and others added 5 commits August 15, 2024 16:06
ghstack-source-id: 7e1c7071f8126072ab0e25194b75f280bf4277ec
Pull Request resolved: pytorch#523
ghstack-source-id: c8f611742ffbb4859988b97e706b9e0d1b4ad6f1
Pull Request resolved: pytorch#521
ghstack-source-id: 9894aa1bc6d6026f59d6a4cc28b573dbb87d20d0
Pull Request resolved: pytorch#526
Copy link
Member

@tigranfah tigranfah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@philippguevorguian philippguevorguian merged commit 4eb849d into main Aug 20, 2024
0 of 5 checks passed
@philippguevorguian philippguevorguian deleted the remove_unneeded_features branch August 21, 2024 15:29
ProgerDav pushed a commit that referenced this pull request Sep 10, 2024
Co-authored-by: Linsong Chu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.