Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

Merged
merged 5 commits into from
Apr 16, 2024
Merged

[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates #230

merged 5 commits into from
Apr 16, 2024

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Apr 15, 2024

Let CheckpointManager take entire job_config as an arg so we can keep train.py a little bit cleaner.

Discussed with @tianyu-l and made a few additional changes, including:

  1. Rename "run_profiler" to "enable_profiling"
  2. Add an "enable_checkpoint" flag so it is consistent to "enable_profiling" or "enable_tensorboard". We feel like this is a little bit more explicit.
  3. Change the default checkpoint folder to be ".outputs/checkpoint" when checkpoint is enabled.
  4. Rename "folder" in [checkpiont]" to be "checkpoint_folder"
  5. Change save_traces_folder to be "./outputs/profile_trace" from ".outputs/profiling/traces".

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2024
@wz337 wz337 marked this pull request as ready for review April 15, 2024 23:46
@wz337 wz337 changed the title [TorchTitan] Move ckpt folder under dump_folder [TorchTitan] Move checkpoint folder under dump_folder Apr 15, 2024
@wz337 wz337 changed the title [TorchTitan] Move checkpoint folder under dump_folder [TorchTitan][Checkpoint] Move checkpoint folder under dump_folder Apr 15, 2024
train.py Outdated
checkpoint = CheckpointManager(
model=model,
optimizer=optimizer,
states={"train_state": train_state},
folder=job_config.checkpoint.folder,
folder=ckpt_folder,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to see if we can further simplify train.py for the checkpoint logic.

can we pass job_config to CheckpointManager, and handle the:

  1. checkpoint folder logic above
  2. set all the options like interval_type/interval/model_weights_only inside the CheckpointManager constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I think I can take the entire job_config.checkpoint and handle this inside checkpoint.py. Let me do that.

@@ -234,18 +234,19 @@ def __init__(self):
type=str,
default="",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, can we use None as default and use it to disable checkpoint? because empty string is also a relative path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an empty string, I'll make the ckpt folder to be None.

train.py Outdated
@@ -229,11 +229,18 @@ def loss_fn(pred, labels):
# train loop
model.train()

ckpt_folder = job_config.checkpoint.folder
ckpt_folder = (
os.path.join(job_config.job.dump_folder, ckpt_folder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC some one of us proposed that we should support both relative path and absolute path. I'm OK with either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think based on the discussion today, we are putting the ckpt under dump_folder. So it would always be relative for right now.

@wz337 wz337 requested review from wanchaol and tianyu-l April 16, 2024 00:41
@wz337 wz337 changed the title [TorchTitan][Checkpoint] Move checkpoint folder under dump_folder [TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and a few config updates Apr 16, 2024
if self.folder:
self.enable_checkpoint = job_config.checkpoint.enable_checkpoint

if self.enable_checkpoint:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have if not self.enable_checkpoint: return and then everything else? just like in save and load functions. Essentially we can make CheckpointManager a noop class if not enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am gonna indent everything under if self.enable_checkpoint:. For the else case, it would just simply exit the constructor and there is no return value.

help="Whether to enable checkpoint",
)
self.parser.add_argument(
"--checkpoint.checkpoint_folder",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After second thoughts, I think it's better to name it checkpoint.folder rather than checkpoint.checkpoint_folder, since there is no ambiguity. The other two appearances of folder need prefix because there could be ambiguity over there.

@@ -12,7 +12,7 @@
@contextlib.contextmanager
def maybe_run_profiler(config: JobConfig, *pos_args, **kwargs):
# get user defined profiler settings
run_profiler = config.profiling.run_profiler
run_profiler = config.profiling.enable_profiling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename run_profiler as well to be consistent

profile_freq = 10
enable_profiling = true
save_traces_folder = "profile_trace"
# profiling frequency - example: 10 means every 10th iter will be profiled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove comment, as there's no ambiguity here.

save_traces_folder = "profiling/traces"
enable_profiling = true
save_traces_folder = "profile_trace"
# profiling frequency - example: 10 means every 10th iter will be profiled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: remove

save_traces_folder = "profiling/traces"
enable_profiling = true
save_traces_folder = "profile_trace"
# profiling frequency - example: 10 means every 10th iter will be profiled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: remove

save_traces_folder = "profiling/traces"
enable_profiling = true
save_traces_folder = "profile_trace"
# profiling frequency - example: 10 means every 10th iter will be profiled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: remove

self.work = None
self.pg = dist.new_group(backend="gloo")
self.doit = None

def reset(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about rename reset and create_checkpoint_id to _reset and _create_checkpoint_id as they are helper function only called within?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset() did get used outside. Just updated _create_checkpoint_id.


if self.enable_checkpoint:
self.folder = os.path.join(
job_config.job.dump_folder, job_config.checkpoint.checkpoint_folder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are calling job_config.checkpoint several times, shall we set checkpoint_config = job_config.checkpoint in the beginning?

@wz337 wz337 requested a review from tianyu-l April 16, 2024 03:55
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks for improving the checkpointing ux! Had somef inal inline comments

help=(
"Checkpointing interval. The unit of measurement is in seconds or "
"steps depending on --checkpoint.interval_type."
"The folder to store the checkpoints."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need a whitespace between sentences; same for the helper messages for other checkpointing options

profile_freq = 10
enable_profiling = true
save_traces_folder = "profile_trace"
profile_freq = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls change back to 10 :)

) -> None:
self.folder = folder
self.states = states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need self.states when not enabling checkpointing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved under if.

@wz337 wz337 merged commit 0f1db56 into pytorch:main Apr 16, 2024
4 checks passed
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
… a few config updates (#230)

Let CheckpointManager take entire job_config as an arg so we can keep
train.py a little bit cleaner.

Discussed with @tianyu-l and made a few additional changes, including:
1. Rename "run_profiler" to "enable_profiling"
2. Add an "enable_checkpoint" flag so it is consistent to
"enable_profiling" or "enable_tensorboard". We feel like this is a
little bit more explicit.
3. Change the default checkpoint folder to be ".outputs/checkpoint" when
checkpoint is enabled.
4. Rename "folder" in [checkpiont]" to be "checkpoint_folder"
5. Change save_traces_folder to be "./outputs/profile_trace" from
".outputs/profiling/traces".
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
… a few config updates (pytorch#230)

Let CheckpointManager take entire job_config as an arg so we can keep
train.py a little bit cleaner.

Discussed with @tianyu-l and made a few additional changes, including:
1. Rename "run_profiler" to "enable_profiling"
2. Add an "enable_checkpoint" flag so it is consistent to
"enable_profiling" or "enable_tensorboard". We feel like this is a
little bit more explicit.
3. Change the default checkpoint folder to be ".outputs/checkpoint" when
checkpoint is enabled.
4. Rename "folder" in [checkpiont]" to be "checkpoint_folder"
5. Change save_traces_folder to be "./outputs/profile_trace" from
".outputs/profiling/traces".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants