Skip to content

Commit

Permalink
Basic integration test infra (#170)
Browse files Browse the repository at this point in the history
Summary:
PR adds an option `use_for_integration_test`. when set to `True`, this
adds the config to the integration test suite. A test runner picks all
the configs marked for integration test and run them.

Test Plan:
```
=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757]
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946
[rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep:  1  �[32mloss: 10.9486  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 20,066  �[35mmfu: 0.25%�[39m
[rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep:  2  �[32mloss: 10.8786  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,046  �[35mmfu: 2.60%�[39m
[rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep:  3  �[32mloss: 10.7362  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 204,441  �[35mmfu: 2.50%�[39m
[rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep:  4  �[32mloss: 10.5094  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,800  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep:  5  �[32mloss: 10.2755  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,527  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep:  6  �[32mloss: 10.0318  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,117  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep:  7  �[32mloss:  9.7929  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,509  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep:  8  �[32mloss:  9.5539  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 166,639  �[35mmfu: 2.04%�[39m
[rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep:  9  �[32mloss:  9.3909  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 120,381  �[35mmfu: 1.47%�[39m
[rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10  �[32mloss:  9.2749  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 207,613  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <[email protected]>
  • Loading branch information
gnadathur and gnadathur authored Mar 27, 2024
1 parent bb61af0 commit 6500bc6
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 2 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/unit_test_4gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
python -m pip install -r requirements.txt
python -m pip install -r dev-requirements.txt
python -m pip install -e .
- name: Run NGPU=4 ./run_llama_train.sh
run: NGPU=4 ./run_llama_train.sh
- name: Run test_runner.py
run: python ./test/test_runner.py
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v3
31 changes: 31 additions & 0 deletions test/test_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
import os
import subprocess

try:
import tomllib
except ModuleNotFoundError:
import tomli as tomllib

CONFIG_DIR = "./train_configs"
for config_file in os.listdir(CONFIG_DIR):
if config_file.endswith(".toml"):
full_path = os.path.join(CONFIG_DIR, config_file)
with open(full_path, "rb") as f:
config = tomllib.load(f)
is_integration_test = config["job"].get("use_for_integration_test", False)
if is_integration_test:
cmd = f"CONFIG_FILE={full_path} NGPU=4 ./run_llama_train.sh"
print(f"=====Integration test: {cmd}=====")
result = subprocess.run(
[cmd],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
shell=True,
)
print(result.stdout)
6 changes: 6 additions & 0 deletions torchtrain/config_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ def __init__(self):
default="default job",
help="description of the job",
)
self.parser.add_argument(
"--job.use_for_integration_test",
default=False,
action="store_true",
help="add this config to integration test suite",
)
# profiling configs
self.parser.add_argument(
"--profiling.run_profiler",
Expand Down
1 change: 1 addition & 0 deletions train_configs/debug_model.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
[job]
dump_folder = "./outputs"
description = "LLaMA debug training"
use_for_integration_test = true

[profiling]
run_profiler = true
Expand Down

0 comments on commit 6500bc6

Please sign in to comment.