Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pp split points optional #604

Open
wants to merge 2 commits into
base: gh/H-Huang/18/base
Choose a base branch
from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Oct 8, 2024

Stack from ghstack (oldest at bottom):

No longer need to pass in layer names for pipeline_parallel_split_points

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 8, 2024
ghstack-source-id: 46a1613fb1921e58be6fce2f85c8718445b24bac
Pull Request resolved: #604
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2024
H-Huang added a commit that referenced this pull request Oct 8, 2024
ghstack-source-id: ca24cbaab8944cb245931ff9bd6703896a0a91e9
Pull Request resolved: #604
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the life-saver!

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to have this! I left some comments on how we should split llama.

Also, apart from this, we still need to finish #450, currently if one feeds a string to pipeline_parallel_split_points it won't work as expected (a list of string is necessary).

Comment on lines +165 to +170
num_layers = model_config.n_layers
if total_stages > num_layers:
raise ValueError("Total stages cannot be greater than the number of layers")
interval = num_layers // total_stages
# Generate split points
splits = ["layers." + str(i * interval) for i in range(1, total_stages)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concern over this auto splitting plan.
For the llama model, we have tok_embedding, final norm and output layers, which are not included in any num_layers here. I understand that they may not necessarily take around the same time as a single TransformerBlock, but can we treat tok_embedding and final output as two layers?

E.g. 405B has 126 TransformerBlock layers, which + embedding and output = 128 layers. From the llama 3.1 paper page 11:

To balance the pipeline, we reduce one Transformer layer each from the first and the last stages, respectively. This means that the first model chunk on the first stage has only the embedding, and the last model chunk on the last stage has only output projection and loss calculation.

From empirical tests, this may or may not yield the best throughput or best memory balance, depending on the specific workload, but the general idea makes sense to me.

interval = num_layers // total_stages
# Generate split points
splits = ["layers." + str(i * interval) for i in range(1, total_stages)]
print(splits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove, or using logger.info

if issubclass(schedule_class, PipelineScheduleSingle):
num_stages_per_rank = 1
elif issubclass(schedule_class, PipelineScheduleMulti):
num_stages_per_rank = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious in principle, can # stages / rank be higher today? It sounds to me that setting this higher would further reduce bubble but requires more layers in the model.
Maybe it helps if we add a comment here, saying this is just a default choice and is not a restriction for manual split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants