Retrieving the Trained Model #1094

dheerj188 · 2024-04-29T08:01:22Z

How can we get back our trained model once we train using the pipe object and Gpipe Scheduler as a normal nn.Module class?

Xynonners · 2024-05-31T06:48:59Z

Also interested in this. Did you ever figure it out?

dheerj188 · 2024-05-31T12:25:39Z

Not as of now.

kwen2501 · 2024-06-10T20:33:17Z

Sorry for replying late.
We have migrated the PiPPy library to torch.distributed.pipelining
Here is our new documentation: https://pytorch.org/docs/main/distributed.pipelining.html.

In section "Option 2", you can see:

The Pipe object provides a method for retrieving the “model partitions”:
stage_mod : nn.Module = pipe.get_stage_module(stage_idx)

The return object is a nn.Module, so you can save it as you would with a regular module, such as:

torch.save(stage_mod, filepath)

or

torch.save(stage_mod.state_dict, filepath)

(Reference: https://pytorch.org/tutorials/beginner/saving_loading_models.html)

Xynonners · 2024-06-11T09:50:05Z

Sorry for replying late. We have migrated the PiPPy library to torch.distributed.pipelining Here is our new documentation: https://pytorch.org/docs/main/distributed.pipelining.html.

In section "Option 2", you can see:

The Pipe object provides a method for retrieving the “model partitions”:
stage_mod : nn.Module = pipe.get_stage_module(stage_idx)

The return object is a nn.Module, so you can save it as you would with a regular module, such as:
torch.save(stage_mod, filepath)
or
torch.save(stage_mod.state_dict, filepath)
(Reference: https://pytorch.org/tutorials/beginner/saving_loading_models.html)

I think the question (at least for me) was if we could turn the model back into the non-pipelined version for modification and saving?

kwen2501 · 2024-06-11T15:19:22Z

Hmm, do you mean getting back the full model at the end of training, but before saving the final checkpoint?
It might be hard, I think, because each stage's updated weights are now on different ranks.
So unless we do an all-gather, the weight in the pipe object would only has part of it being up-to-date.

That said, imagine we would do a torch.load later, that would be a good time for gluing the model back together, because:
(i) we have the full, original model; and
(ii) PP does not change the FQN of the weights.

It is only a matter of loading from a single checkpoint file vs multiple checkpoint files.
As far as I know, HF already uses multiple checkpoint files for large models.

dheerj188 · 2024-06-12T07:30:56Z

OK, so here is what I want to do, Obtain gradients of each layer from each rank of the stage from the pipe object, and send it to the CPUs. Get some modifications done on the gradients on the CPU, then bring it back to the subsequent ranks of the pipeline stage and update the model with modified gradients. Is this possible with Pippy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving the Trained Model #1094

Retrieving the Trained Model #1094

dheerj188 commented Apr 29, 2024

Xynonners commented May 31, 2024

dheerj188 commented May 31, 2024

kwen2501 commented Jun 10, 2024 •

edited

Loading

Xynonners commented Jun 11, 2024 •

edited

Loading

kwen2501 commented Jun 11, 2024

dheerj188 commented Jun 12, 2024

Retrieving the Trained Model #1094

Retrieving the Trained Model #1094

Comments

dheerj188 commented Apr 29, 2024

Xynonners commented May 31, 2024

dheerj188 commented May 31, 2024

kwen2501 commented Jun 10, 2024 • edited Loading

Xynonners commented Jun 11, 2024 • edited Loading

kwen2501 commented Jun 11, 2024

dheerj188 commented Jun 12, 2024

kwen2501 commented Jun 10, 2024 •

edited

Loading

Xynonners commented Jun 11, 2024 •

edited

Loading