Basic integration test infra (#170)

Summary: PR adds an option `use_for_integration_test`. when set to `True`, this adds the config to the integration test suite. A test runner picks all the configs marked for integration test and run them. Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946 [rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep: 1 �[32mloss: 10.9486 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 20,066 �[35mmfu: 0.25%�[39m [rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep: 2 �[32mloss: 10.8786 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,046 �[35mmfu: 2.60%�[39m [rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep: 3 �[32mloss: 10.7362 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 204,441 �[35mmfu: 2.50%�[39m [rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep: 4 �[32mloss: 10.5094 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,800 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep: 5 �[32mloss: 10.2755 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,527 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep: 6 �[32mloss: 10.0318 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,117 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep: 7 �[32mloss: 9.7929 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,509 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep: 8 �[32mloss: 9.5539 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 166,639 �[35mmfu: 2.04%�[39m [rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep: 9 �[32mloss: 9.3909 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 120,381 �[35mmfu: 1.47%�[39m [rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10 �[32mloss: 9.2749 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 207,613 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <[email protected]>
pytorch · Mar 27, 2024 · 6500bc6 · 6500bc6
1 parent bb61af0
commit 6500bc6
Show file tree

Hide file tree

Showing 4 changed files with 40 additions and 2 deletions.
diff --git a/.github/workflows/unit_test_4gpu.yaml b/.github/workflows/unit_test_4gpu.yaml
@@ -37,7 +37,7 @@ jobs:
           python -m pip install -r requirements.txt
           python -m pip install -r dev-requirements.txt
           python -m pip install -e .
-      - name: Run NGPU=4 ./run_llama_train.sh
-        run: NGPU=4 ./run_llama_train.sh
+      - name: Run test_runner.py
+        run: python ./test/test_runner.py
       - name: Upload Coverage to Codecov
         uses: codecov/codecov-action@v3
diff --git a/test/test_runner.py b/test/test_runner.py
@@ -0,0 +1,31 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+import os
+import subprocess
+
+try:
+    import tomllib
+except ModuleNotFoundError:
+    import tomli as tomllib
+
+CONFIG_DIR = "./train_configs"
+for config_file in os.listdir(CONFIG_DIR):
+    if config_file.endswith(".toml"):
+        full_path = os.path.join(CONFIG_DIR, config_file)
+        with open(full_path, "rb") as f:
+            config = tomllib.load(f)
+            is_integration_test = config["job"].get("use_for_integration_test", False)
+            if is_integration_test:
+                cmd = f"CONFIG_FILE={full_path} NGPU=4 ./run_llama_train.sh"
+                print(f"=====Integration test: {cmd}=====")
+                result = subprocess.run(
+                    [cmd],
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.STDOUT,
+                    text=True,
+                    shell=True,
+                )
+                print(result.stdout)
diff --git a/torchtrain/config_manager.py b/torchtrain/config_manager.py
@@ -61,6 +61,12 @@ def __init__(self):
             default="default job",
             help="description of the job",
         )
+        self.parser.add_argument(
+            "--job.use_for_integration_test",
+            default=False,
+            action="store_true",
+            help="add this config to integration test suite",
+        )
         # profiling configs
         self.parser.add_argument(
             "--profiling.run_profiler",

diff --git a/train_configs/debug_model.toml b/train_configs/debug_model.toml
@@ -2,6 +2,7 @@
 [job]
 dump_folder = "./outputs"
 description = "LLaMA debug training"
+use_for_integration_test = true
 
 [profiling]
 run_profiler = true