Make pp split points optional #604

H-Huang · 2024-10-08T15:12:24Z

Stack from ghstack (oldest at bottom):

No longer need to pass in layer names for pipeline_parallel_split_points

[ghstack-poisoned]

ghstack-source-id: 46a1613fb1921e58be6fce2f85c8718445b24bac Pull Request resolved: #604

[ghstack-poisoned]

ghstack-source-id: ca24cbaab8944cb245931ff9bd6703896a0a91e9 Pull Request resolved: #604

XilunWu

thx for the life-saver!

tianyu-l

It's great to have this! I left some comments on how we should split llama.

Also, apart from this, we still need to finish #450, currently if one feeds a string to pipeline_parallel_split_points it won't work as expected (a list of string is necessary).

tianyu-l · 2024-10-08T22:56:08Z

torchtitan/parallelisms/pipeline_llama.py

+        num_layers = model_config.n_layers
+        if total_stages > num_layers:
+            raise ValueError("Total stages cannot be greater than the number of layers")
+        interval = num_layers // total_stages
+        # Generate split points
+        splits = ["layers." + str(i * interval) for i in range(1, total_stages)]


I have some concern over this auto splitting plan.
For the llama model, we have tok_embedding, final norm and output layers, which are not included in any num_layers here. I understand that they may not necessarily take around the same time as a single TransformerBlock, but can we treat tok_embedding and final output as two layers?

E.g. 405B has 126 TransformerBlock layers, which + embedding and output = 128 layers. From the llama 3.1 paper page 11:

To balance the pipeline, we reduce one Transformer layer each from the first and the last stages, respectively. This means that the first model chunk on the first stage has only the embedding, and the last model chunk on the last stage has only output projection and loss calculation.

From empirical tests, this may or may not yield the best throughput or best memory balance, depending on the specific workload, but the general idea makes sense to me.

tianyu-l · 2024-10-08T22:56:31Z

torchtitan/parallelisms/pipeline_llama.py

+        interval = num_layers // total_stages
+        # Generate split points
+        splits = ["layers." + str(i * interval) for i in range(1, total_stages)]
+        print(splits)


remove, or using logger.info

tianyu-l · 2024-10-08T23:13:51Z

torchtitan/parallelisms/pipeline_llama.py

+        if issubclass(schedule_class, PipelineScheduleSingle):
+            num_stages_per_rank = 1
+        elif issubclass(schedule_class, PipelineScheduleMulti):
+            num_stages_per_rank = 2


Curious in principle, can # stages / rank be higher today? It sounds to me that setting this higher would further reduce bubble but requires more layers in the model.
Maybe it helps if we add a comment here, saying this is just a default choice and is not a restriction for manual split.

Make pp split points optional

3510487

[ghstack-poisoned]

H-Huang mentioned this pull request Oct 3, 2024

Retrieve schedules from get_schedule_class() #595

Open

H-Huang added a commit that referenced this pull request Oct 8, 2024

Make pp split points optional

67576be

ghstack-source-id: 46a1613fb1921e58be6fce2f85c8718445b24bac Pull Request resolved: #604

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2024

Update on "Make pp split points optional"

a057c7d

[ghstack-poisoned]

H-Huang added a commit that referenced this pull request Oct 8, 2024

Make pp split points optional

8c46891

ghstack-source-id: ca24cbaab8944cb245931ff9bd6703896a0a91e9 Pull Request resolved: #604

H-Huang requested review from tianyu-l and XilunWu October 8, 2024 15:24

H-Huang mentioned this pull request Oct 8, 2024

Add zero bubble to test runner #605

Open

XilunWu approved these changes Oct 8, 2024

View reviewed changes

tianyu-l requested changes Oct 8, 2024

View reviewed changes

tianyu-l reviewed Oct 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make pp split points optional #604

Make pp split points optional #604

H-Huang commented Oct 8, 2024 •

edited

Loading

XilunWu left a comment

tianyu-l left a comment

tianyu-l Oct 8, 2024

tianyu-l Oct 8, 2024

tianyu-l Oct 8, 2024

Make pp split points optional #604

Are you sure you want to change the base?

Make pp split points optional #604

Conversation

H-Huang commented Oct 8, 2024 • edited Loading

XilunWu left a comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Oct 8, 2024

Choose a reason for hiding this comment

tianyu-l Oct 8, 2024

Choose a reason for hiding this comment

tianyu-l Oct 8, 2024

Choose a reason for hiding this comment

H-Huang commented Oct 8, 2024 •

edited

Loading