[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

wz337 · 2024-04-15T23:46:09Z

Let CheckpointManager take entire job_config as an arg so we can keep train.py a little bit cleaner.

Discussed with @tianyu-l and made a few additional changes, including:

Rename "run_profiler" to "enable_profiling"
Add an "enable_checkpoint" flag so it is consistent to "enable_profiling" or "enable_tensorboard". We feel like this is a little bit more explicit.
Change the default checkpoint folder to be ".outputs/checkpoint" when checkpoint is enabled.
Rename "folder" in [checkpiont]" to be "checkpoint_folder"
Change save_traces_folder to be "./outputs/profile_trace" from ".outputs/profiling/traces".

wanchaol · 2024-04-16T00:00:36Z

train.py

    checkpoint = CheckpointManager(
        model=model,
        optimizer=optimizer,
        states={"train_state": train_state},
-        folder=job_config.checkpoint.folder,
+        folder=ckpt_folder,


Trying to see if we can further simplify train.py for the checkpoint logic.

can we pass job_config to CheckpointManager, and handle the:

checkpoint folder logic above

set all the options like interval_type/interval/model_weights_only inside the CheckpointManager constructor?

Sure. I think I can take the entire job_config.checkpoint and handle this inside checkpoint.py. Let me do that.

tianyu-l · 2024-04-16T00:17:30Z

torchtitan/config_manager.py

@@ -234,18 +234,19 @@ def __init__(self):
            type=str,
            default="",


As discussed offline, can we use None as default and use it to disable checkpoint? because empty string is also a relative path.

If this is an empty string, I'll make the ckpt folder to be None.

tianyu-l · 2024-04-16T00:20:32Z

train.py

@@ -229,11 +229,18 @@ def loss_fn(pred, labels):
    # train loop
    model.train()

+    ckpt_folder = job_config.checkpoint.folder
+    ckpt_folder = (
+        os.path.join(job_config.job.dump_folder, ckpt_folder)


IIRC some one of us proposed that we should support both relative path and absolute path. I'm OK with either way.

I think based on the discussion today, we are putting the ckpt under dump_folder. So it would always be relative for right now.

tianyu-l · 2024-04-16T03:02:26Z

torchtitan/checkpoint.py

-        if self.folder:
+        self.enable_checkpoint = job_config.checkpoint.enable_checkpoint
+
+        if self.enable_checkpoint:


Can we have if not self.enable_checkpoint: return and then everything else? just like in save and load functions. Essentially we can make CheckpointManager a noop class if not enabled.

I am gonna indent everything under if self.enable_checkpoint:. For the else case, it would just simply exit the constructor and there is no return value.

tianyu-l · 2024-04-16T03:04:25Z

torchtitan/config_manager.py

+            help="Whether to enable checkpoint",
+        )
+        self.parser.add_argument(
+            "--checkpoint.checkpoint_folder",


After second thoughts, I think it's better to name it checkpoint.folder rather than checkpoint.checkpoint_folder, since there is no ambiguity. The other two appearances of folder need prefix because there could be ambiguity over there.

tianyu-l · 2024-04-16T03:05:12Z

torchtitan/profiling.py

@@ -12,7 +12,7 @@
 @contextlib.contextmanager
 def maybe_run_profiler(config: JobConfig, *pos_args, **kwargs):
    # get user defined profiler settings
-    run_profiler = config.profiling.run_profiler
+    run_profiler = config.profiling.enable_profiling


let's rename run_profiler as well to be consistent

tianyu-l · 2024-04-16T03:05:50Z

train_configs/debug_model.toml

-profile_freq = 10
+enable_profiling = true
+save_traces_folder = "profile_trace"
+# profiling frequency - example: 10 means every 10th iter will be profiled


I think we can remove comment, as there's no ambiguity here.

tianyu-l · 2024-04-16T03:06:12Z

train_configs/llama_13b.toml

-save_traces_folder = "profiling/traces"
+enable_profiling = true
+save_traces_folder = "profile_trace"
+# profiling frequency - example: 10 means every 10th iter will be profiled


ditto: remove

tianyu-l · 2024-04-16T03:06:22Z

train_configs/llama_70b.toml

-save_traces_folder = "profiling/traces"
+enable_profiling = true
+save_traces_folder = "profile_trace"
+# profiling frequency - example: 10 means every 10th iter will be profiled


ditto: remove

tianyu-l · 2024-04-16T03:06:35Z

train_configs/llama_7b.toml

-save_traces_folder = "profiling/traces"
+enable_profiling = true
+save_traces_folder = "profile_trace"
+# profiling frequency - example: 10 means every 10th iter will be profiled


ditto: remove

tianyu-l · 2024-04-16T03:10:04Z

torchtitan/checkpoint.py

+        self.work = None
+        self.pg = dist.new_group(backend="gloo")
+        self.doit = None
+
    def reset(self) -> None:


how about rename reset and create_checkpoint_id to _reset and _create_checkpoint_id as they are helper function only called within?

reset() did get used outside. Just updated _create_checkpoint_id.

tianyu-l · 2024-04-16T03:12:10Z

torchtitan/checkpoint.py

+
+        if self.enable_checkpoint:
+            self.folder = os.path.join(
+                job_config.job.dump_folder, job_config.checkpoint.checkpoint_folder


since we are calling job_config.checkpoint several times, shall we set checkpoint_config = job_config.checkpoint in the beginning?

tianyu-l

lgtm! Thanks for improving the checkpointing ux! Had somef inal inline comments

tianyu-l · 2024-04-16T04:07:36Z

torchtitan/config_manager.py

            help=(
-                "Checkpointing interval. The unit of measurement is in seconds or "
-                "steps depending on --checkpoint.interval_type."
+                "The folder to store the checkpoints."


nit: need a whitespace between sentences; same for the helper messages for other checkpointing options

tianyu-l · 2024-04-16T04:14:27Z

train_configs/debug_model.toml

-profile_freq = 10
+enable_profiling = true
+save_traces_folder = "profile_trace"
+profile_freq = 100


pls change back to 10 :)

tianyu-l · 2024-04-16T04:16:42Z

torchtitan/checkpoint.py

    ) -> None:
-        self.folder = folder
        self.states = states


do we need self.states when not enabling checkpointing?

Moved under if.

@tianyu-l

… a few config updates (#230) Let CheckpointManager take entire job_config as an arg so we can keep train.py a little bit cleaner. Discussed with @tianyu-l and made a few additional changes, including: 1. Rename "run_profiler" to "enable_profiling" 2. Add an "enable_checkpoint" flag so it is consistent to "enable_profiling" or "enable_tensorboard". We feel like this is a little bit more explicit. 3. Change the default checkpoint folder to be ".outputs/checkpoint" when checkpoint is enabled. 4. Rename "folder" in [checkpiont]" to be "checkpoint_folder" 5. Change save_traces_folder to be "./outputs/profile_trace" from ".outputs/profiling/traces".

@tianyu-l

… a few config updates (pytorch#230) Let CheckpointManager take entire job_config as an arg so we can keep train.py a little bit cleaner. Discussed with @tianyu-l and made a few additional changes, including: 1. Rename "run_profiler" to "enable_profiling" 2. Add an "enable_checkpoint" flag so it is consistent to "enable_profiling" or "enable_tensorboard". We feel like this is a little bit more explicit. 3. Change the default checkpoint folder to be ".outputs/checkpoint" when checkpoint is enabled. 4. Rename "folder" in [checkpiont]" to be "checkpoint_folder" 5. Change save_traces_folder to be "./outputs/profile_trace" from ".outputs/profiling/traces".

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2024

wz337 requested review from wconstab, tianyu-l and fegin April 15, 2024 23:46

wz337 marked this pull request as ready for review April 15, 2024 23:46

wz337 changed the title ~~[TorchTitan] Move ckpt folder under dump_folder~~ [TorchTitan] Move checkpoint folder under dump_folder Apr 15, 2024

wz337 changed the title ~~[TorchTitan] Move checkpoint folder under dump_folder~~ [TorchTitan][Checkpoint] Move checkpoint folder under dump_folder Apr 15, 2024

wanchaol reviewed Apr 16, 2024

View reviewed changes

tianyu-l reviewed Apr 16, 2024

View reviewed changes

wz337 requested review from wanchaol and tianyu-l April 16, 2024 00:41

wz337 added 2 commits April 15, 2024 19:28

move ckpt under outputs

5a1966b

address comments

8e3b277

wz337 changed the title ~~[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder~~ [TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates Apr 16, 2024

address TY's comments

929e5ff

tianyu-l reviewed Apr 16, 2024

View reviewed changes

address TY's comments

273c566

wz337 requested a review from tianyu-l April 16, 2024 03:55

tianyu-l approved these changes Apr 16, 2024

View reviewed changes

address final comments

687322f

wz337 merged commit 0f1db56 into pytorch:main Apr 16, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

wz337 commented Apr 15, 2024 •

edited

Loading

wanchaol Apr 16, 2024

wz337 Apr 16, 2024

tianyu-l Apr 16, 2024

wz337 Apr 16, 2024

tianyu-l Apr 16, 2024

wz337 Apr 16, 2024

tianyu-l Apr 16, 2024

wz337 Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

wz337 Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l left a comment

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

tianyu-l Apr 16, 2024

wz337 Apr 16, 2024

		@@ -234,18 +234,19 @@ def __init__(self):
		type=str,
		default="",

[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

Conversation

wz337 commented Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wz337 commented Apr 15, 2024 •

edited

Loading