Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: PR adds an option `use_for_integration_test`. when set to `True`, this adds the config to the integration test suite. A test runner picks all the configs marked for integration test and run them. Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946 [rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep: 1 �[32mloss: 10.9486 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 20,066 �[35mmfu: 0.25%�[39m [rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep: 2 �[32mloss: 10.8786 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,046 �[35mmfu: 2.60%�[39m [rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep: 3 �[32mloss: 10.7362 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 204,441 �[35mmfu: 2.50%�[39m [rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep: 4 �[32mloss: 10.5094 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,800 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep: 5 �[32mloss: 10.2755 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,527 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep: 6 �[32mloss: 10.0318 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,117 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep: 7 �[32mloss: 9.7929 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,509 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep: 8 �[32mloss: 9.5539 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 166,639 �[35mmfu: 2.04%�[39m [rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep: 9 �[32mloss: 9.3909 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 120,381 �[35mmfu: 1.47%�[39m [rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10 �[32mloss: 9.2749 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 207,613 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>
- Loading branch information