-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make pp split points optional #604
base: gh/H-Huang/18/base
Are you sure you want to change the base?
Conversation
[ghstack-poisoned]
ghstack-source-id: 46a1613fb1921e58be6fce2f85c8718445b24bac Pull Request resolved: #604
[ghstack-poisoned]
ghstack-source-id: ca24cbaab8944cb245931ff9bd6703896a0a91e9 Pull Request resolved: #604
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for the life-saver!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to have this! I left some comments on how we should split llama.
Also, apart from this, we still need to finish #450, currently if one feeds a string to pipeline_parallel_split_points
it won't work as expected (a list of string is necessary).
num_layers = model_config.n_layers | ||
if total_stages > num_layers: | ||
raise ValueError("Total stages cannot be greater than the number of layers") | ||
interval = num_layers // total_stages | ||
# Generate split points | ||
splits = ["layers." + str(i * interval) for i in range(1, total_stages)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concern over this auto splitting plan.
For the llama model, we have tok_embedding
, final norm
and output
layers, which are not included in any num_layers
here. I understand that they may not necessarily take around the same time as a single TransformerBlock
, but can we treat tok_embedding and final output as two layers?
E.g. 405B has 126 TransformerBlock layers, which + embedding and output = 128 layers. From the llama 3.1 paper page 11:
To balance the pipeline, we reduce one Transformer layer each from the first and the last stages, respectively. This means that the first model chunk on the first stage has only the embedding, and the last model chunk on the last stage has only output projection and loss calculation.
From empirical tests, this may or may not yield the best throughput or best memory balance, depending on the specific workload, but the general idea makes sense to me.
interval = num_layers // total_stages | ||
# Generate split points | ||
splits = ["layers." + str(i * interval) for i in range(1, total_stages)] | ||
print(splits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove, or using logger.info
if issubclass(schedule_class, PipelineScheduleSingle): | ||
num_stages_per_rank = 1 | ||
elif issubclass(schedule_class, PipelineScheduleMulti): | ||
num_stages_per_rank = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious in principle, can # stages / rank be higher today? It sounds to me that setting this higher would further reduce bubble but requires more layers in the model.
Maybe it helps if we add a comment here, saying this is just a default choice and is not a restriction for manual split.
Stack from ghstack (oldest at bottom):
No longer need to pass in layer names for
pipeline_parallel_split_points