dump memory snapshot to analyze OOMs #395

weifengpy · 2024-06-13T04:54:14Z

when setting enable_memory_snapshot = true in .toml

dump memory snapshots in case of OOMs. output folder is memory_snapshot/iteration_x_exit
dump regularly according to profile_freq. output folder is memory_snapshot/iteration_x
existing .toml works since enable_memory_snapshot=False by default

snapshot is an example of the dump when OOM happens

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

torchtitan/profiling.py

weifengpy · 2024-06-13T04:58:18Z

train_configs/debug_model.toml

@@ -9,6 +9,7 @@ use_for_integration_test = true
 enable_profiling = true
 save_traces_folder = "profile_trace"
 profile_freq = 10
+enable_memory_snapshot = false


existing .toml without enable_memory_snapshot still works. enable_memory_snapshot is optional with getattr(config.profiling, 'enable_memory_snapshot', False) I am just adding it here so people can start toggle it

ditto: we should put default option False into config_manager, and remove this option in all the toml config files. Maybe only enable it to True in debug_model.

weifengpy · 2024-06-13T05:10:26Z

torchtitan/profiling.py

@@ -15,6 +16,14 @@
 # the number of warmup steps before the active step in each profiling cycle
 WARMUP = 3

+# how much memory allocation/free ops to record in memory snapshots
+MEMORY_SNAPSHOT_MAX_ENTRIES = 100000


MEMORY_SNAPSHOT_MAX_ENTRIES controls how large .pickle can be. Right now it's 36MB

fegin · 2024-06-13T17:10:24Z

torchtitan/profiling.py

+                with open(
+                    f"{curr_snapshot_dir}/rank{rank}_memory_snapshot.pickle", "wb"
+                ) as output:
+                    pickle.dump(torch.cuda.memory._snapshot(), output)


Maybe add a threshold to control dumping the memory snapshot when the memory usage is larger than the threshold to avoid overwhelming data?

do you mean in MB threashold? Right now it's bounded by number of free/allocate MEMORY_SNAPSHOT_MAX_ENTRIES. For MB, I can google around

googled for MB threshold but did not find something useful. Currently MEMORY_SNAPSHOT_MAX_ENTRIES=100000 conroled the file size to 36MB. Let me know if this is still a blocker

tianyu-l

This is a great addition to torchtitan! Had some comments on how to structure the configs.

Also, I wonder if it makes sense to have a very short tutorial on how to read/parse the output of memory profiler. Maybe extract part of this tutorial.

tianyu-l · 2024-06-13T20:35:09Z

torchtitan/profiling.py

+# default memory snapshot folder
+ENABLE_MEMORY_SNAPSHOT_KEY = "enable_memory_snapshot"
+MEMORY_SNAPSHOT_FOLDER_KEY = "memory_snapshot_folder"
+MEMORY_SNAPSHOT_FOLDER_DEFAULT_VALUE = "memory_snapshot"


We should make these into configs. Please refer to how torch_profiler does this part, e.g. put into config_manager.py

good to know config_manager.py. I will move deafults into config_manager

tianyu-l · 2024-06-13T20:38:41Z

train_configs/debug_model.toml

@@ -9,6 +9,7 @@ use_for_integration_test = true
 enable_profiling = true
 save_traces_folder = "profile_trace"
 profile_freq = 10
+enable_memory_snapshot = false


ditto: we should put default option False into config_manager, and remove this option in all the toml config files. Maybe only enable it to True in debug_model.

weifengpy · 2024-06-13T21:34:41Z

convert to draft now and will publish again after moving default into config_manager.py. But current version is good for benchmarking float8 + compile + fsdp2 on MAST

tianyu-l

The overhead from torch.profiler is only around the steps where dumping actually happens (warmup steps + actual profiling steps). If _record_memory_history is always enabled for entire training, there will constantly be overhead from this memory profiler.

In other words, torch.profiler only profile one step per freq steps, while MemoryProfiler profiles every step and dump all freq iterations per freq steps. As a result, adjusting freq only affects how often the snapshot are grouped into one pickle file. If we run a job 3000 steps, there will be snapshot for every step, regardless of freq.

tianyu-l · 2024-06-16T03:03:26Z

torchtitan/profiling.py

+                if not exit_ctx and self.step_num % self.freq != 0:
+                    self.step_num += 1
+                    return
+                if not exit_ctx:
+                    curr_step = self.step_num
+                    self.step_num += 1
+                    dir_name = f"iteration_{curr_step}"


torch.profiler starts from step 0, whereas train.py starts from step 1. In order to make things work as expected, I suggest we do the following, so that if we set profile_freq=10 and run training for 10 steps, there will be memory snapshots for iteration_10 (similar to torch.profiler) and iteration_10_exit. I've tested this offline.

Suggested change

if not exit_ctx and self.step_num % self.freq != 0:

self.step_num += 1

return

if not exit_ctx:

curr_step = self.step_num

self.step_num += 1

dir_name = f"iteration_{curr_step}"

self.step_num += 1

if not exit_ctx and self.step_num % self.freq != 0:

return

if not exit_ctx:

curr_step = self.step_num

dir_name = f"iteration_{curr_step}"

thanks for pointing out the difference. updated accordingly

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

tianyu-l

great work!! thank you!
please address my nits before merge :)

tianyu-l · 2024-06-18T23:15:17Z

train_configs/debug_model.toml

@@ -9,6 +9,7 @@ use_for_integration_test = true
 enable_profiling = true
 save_traces_folder = "profile_trace"
 profile_freq = 10
+enable_memory_snapshot = true


nit: let's put the folder here as well to be consistent and informative

added save_memory_snapshot_folder in .toml

tianyu-l · 2024-06-18T23:17:26Z

torchtitan/config_manager.py

+            help="Whether to dump memory snapshot",
+        )
+        self.parser.add_argument(
+            "--profiling.memory_snapshot_folder",


nit: please rename it save_memory_snapshot_folder to be consistent with save_traces_folder and save_tb_folder.

tianyu-l · 2024-06-18T23:21:07Z

torchtitan/config_manager.py

+        self.parser.add_argument(
+            "--profiling.memory_snapshot_folder",
+            type=str,
+            default="memory_snapshots",


nit: let's call it memory_snapshot

weifengpy · 2024-06-19T00:10:04Z

great work!! thank you! please address my nits before merge :)

thanks. will address feedback before merging

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

when setting `enable_memory_snapshot = true` in `.toml` * dump memory snapshots in case of OOMs. output folder is `memory_snapshot/iteration_x_exit` * dump regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x` * existing `.toml` works since `enable_memory_snapshot=False` by default snapshot is an example of the dump when OOM happens <img width="1640" alt="Screenshot 2024-06-12 at 9 26 53 PM" src="https://github.com/pytorch/torchtitan/assets/134637289/6420799c-ae68-4b35-b8bb-f5b6ab3dd053">

weifengpy and others added 3 commits June 12, 2024 11:09

del logits=(bs, seq_len, vocab_size) to save 3.9G memory

6cd4853

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'pytorch:main' into memory

ae76243

dump memory snapshot to analyze OOM

510e9f8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 13, 2024

weifengpy commented Jun 13, 2024

View reviewed changes

torchtitan/profiling.py Show resolved Hide resolved

weifengpy commented Jun 13, 2024

View reviewed changes

weifengpy requested review from wanchaol, tianyu-l and bdhirsh June 13, 2024 05:05

weifengpy commented Jun 13, 2024

View reviewed changes

fegin reviewed Jun 13, 2024

View reviewed changes

tianyu-l requested changes Jun 13, 2024

View reviewed changes

weifengpy marked this pull request as draft June 13, 2024 21:34

Merge branch 'pytorch:main' into memory

76ab55d

tianyu-l reviewed Jun 16, 2024

View reviewed changes

weifengpy and others added 4 commits June 17, 2024 17:03

avoid -1 iteration

87aaf9e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'main' into memory

48e8bc8

step from 1 and exit at OOM

e7a3b08

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

resolve merge conflict

0639da9

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review June 18, 2024 16:14

weifengpy requested a review from tianyu-l June 18, 2024 16:14

tianyu-l approved these changes Jun 18, 2024

View reviewed changes

consistent naming with profiler trace

abec6fb

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy merged commit 8adbfa3 into pytorch:main Jun 19, 2024
5 checks passed

This was referenced Jun 24, 2024

improve memory profiler to not to profile every iteration #422

Closed

by default disable heavy memory profiling #430

Merged

mori360 mentioned this pull request Oct 8, 2024

[BE][doc] add memory_profiler to README #606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dump memory snapshot to analyze OOMs #395

dump memory snapshot to analyze OOMs #395

weifengpy commented Jun 13, 2024 •

edited

Loading

weifengpy Jun 13, 2024 •

edited

Loading

tianyu-l Jun 13, 2024

weifengpy Jun 13, 2024 •

edited

Loading

fegin Jun 13, 2024

weifengpy Jun 13, 2024 •

edited

Loading

weifengpy Jun 18, 2024

tianyu-l left a comment

tianyu-l Jun 13, 2024

weifengpy Jun 13, 2024

tianyu-l Jun 13, 2024

weifengpy commented Jun 13, 2024

tianyu-l left a comment •

edited

Loading

tianyu-l Jun 16, 2024

weifengpy Jun 18, 2024

tianyu-l left a comment

tianyu-l Jun 18, 2024

weifengpy Jun 19, 2024

tianyu-l Jun 18, 2024

tianyu-l Jun 18, 2024

weifengpy commented Jun 19, 2024

dump memory snapshot to analyze OOMs #395

dump memory snapshot to analyze OOMs #395

Conversation

weifengpy commented Jun 13, 2024 • edited Loading

weifengpy Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy commented Jun 13, 2024

tianyu-l left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy commented Jun 19, 2024

weifengpy commented Jun 13, 2024 •

edited

Loading

weifengpy Jun 13, 2024 •

edited

Loading

weifengpy Jun 13, 2024 •

edited

Loading

weifengpy Jun 13, 2024 •

edited

Loading

tianyu-l left a comment •

edited

Loading