[2D][TP] Enable DDP TP integration with unit test #106583

fduwjj · 2023-08-03T23:15:06Z

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2023-08-03T23:15:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106583

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c1b598f with merge base d8ad748 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 82c02d6cb4119a3eb23fcff7d51740efa9c997bb Pull Request resolved: #106583

docs/source/distributed.tensor.parallel.rst

torch/distributed/tensor/parallel/ddp.py

awgu · 2023-08-04T02:27:17Z

torch/distributed/tensor/parallel/ddp.py

+    _update_model_param(param_list)  # type: ignore[arg-type]
+
+
+def pre_dp_model_transform(model: nn.Module):


I recommend naming the arg module instead of model as we push for distributed API composability since model generally refers to the root module, whereas module could mean a submodule.

In that case, you may also prefer the function name pre_dp_module_transform.

More broadly, I was wondering: Should we build this logic into DDP/FSDP and hide it from the user (avoiding the extra call)?

The natural follow-up question is how would users disable this logic if they are using DTensor in their own way and do not want this conversion logic done for them? In that case, our "official" TP API parallelize_module() can mark the constructed DTensors specially, and DDP/FSDP can only register this special logic if it detects such marked DTensors in their managed parameters.

What do you think?

For FSDP, we already have the extension embedded inside FSDP already right? But this causes issues in state_dict so @fegin and @kumpera we think this would be a better UX for TP + DP.

Also if users want to special handling we can either give user an option to register a customized handler here or user can choose not to call this API rather than embed into DDP/FSDP code?

But this causes issues in state_dict

Could you guys clarify what the issues with state dict are when using the extensions? (or point me to the right doc that describes this)

Actually, I also prefer to make this logic in FSDP and DDP. What I suggested is to use hook to implement which is what this PR does. But I think the information should be a state of DDP instead of using _st_info and is attached to the parameter. Also registering the hooks should happen inside the CTOR of DDP.

awgu · 2023-08-04T02:30:23Z

torch/distributed/tensor/parallel/ddp.py

+    """
+
+    _localize_dtensor(model, None, None)
+    model.register_forward_pre_hook(_reconstruct_dtensor)


How do we expect the hook ordering to compose with other hook-based APIs?

Composable with composable API is out of scope of this PR. But the idea is the same, For replicate, TP needs to convert tensor to DTensor before forward begins and do the reverse thing after FWD. For FSDP, the hook is very complicated, I have not thought more on that, but it follows what we are doing in the extension.

To give an example of what I am wondering: How would this compose with a hook-based activation checkpointing API? Should this registered hook come before AC or after AC? Are making this registration order clear to the user?

awgu · 2023-08-04T02:31:40Z

torch/distributed/tensor/parallel/ddp.py

+    Recontruct DTensor parameters from local tensors
+    """
+    param_list = []
+    for name, t in model.named_parameters():


If we have DDP above FSDP, then will this try to convert all FSDP parameters to DTensor as well?

No, we only convert parameters which are DTensors.

What I meant is that, would this convert DTensors under an FSDP-managed module, not a DDP-managed module?

For now, no. We hope down the road, we can merge FSDP-managed module into this API, too.

My point is that model.named_parameters() recurses into submodules, which maybe managed by FSDP. There is no check against that to stop the recursion. This means that if there is a DDP module above FSDP modules, then this will convert the FSDP-managed DTensors to local tensor too. Is that the desired behavior?

[ghstack-poisoned]

ghstack-source-id: 06dced7618a6887b143210bd59bf69ad20077d16 Pull Request resolved: #106583

kumpera

Bunch of trivial doc fixes.

My concerns with this PR is the following:

no error checking at all.

We don't fail if we run into a FSDP module, for example.
Do we support all forms of DTensors sharding?

no module traversal caching
We should cache the per-param sharding_info and use that to flatten/unflatten the DTensors.
This would be faster and more composable since it would respect the model decisions at the time we called pre_dp_module_transform.
It's not explicit about how it leaves the model outside of fwd/bwd.

This is relevant if we apply more parallelization transforms after it.

torch/distributed/tensor/parallel/ddp.py

test/distributed/tensor/parallel/test_ddp_2d_parallel.py

awgu · 2023-08-04T16:48:00Z

Should we have a broader API discussion? Maybe we can land the transform function as private first (i.e. with a leading underscore)?

fduwjj · 2023-08-04T17:08:20Z

@awgu agree that we want to have a broad API discussion and there is indeed an ongoing broader API discussion going on right now among @wanchaol, @fegin, @wz337, @rohan-varma and you to eventually leverage DeviceMesh for a unified solution for both [functional, composable] × [DDP, FSDP]. Since we already make 2d_fsdp api public and TP is still in prototype (no Backward compatibility guarantee), I think it's still ok to make this API public for now.

fduwjj · 2023-08-04T17:15:23Z

@kumpera sure. For 1 and 2, I will send follow-up PRs to address them. For 3, I think we also need a hook for state_dict and optimizer_state_dict.

test/distributed/tensor/parallel/test_ddp_2d_parallel.py

torch/distributed/tensor/parallel/ddp.py

fegin

Accept to unblock. Agree to @awgu, we should have a broader discussion how device mesh is used for DDP and FSDP. Then we may want to move the implementation to DDP.

[ghstack-poisoned]

ghstack-source-id: a64981ad1765d6f4332c84bfd456a6d089a28292 Pull Request resolved: #106583

wanchaol

nice work!

fduwjj · 2023-08-16T22:52:38Z

@pytorchbot merge

pytorchmergebot · 2023-08-16T22:56:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #107397 Approved by: https://github.com/wanchaol ghstack dependencies: #107313, #106583

[2D][TP] Enable DDP TP integration with unit test

e355e38

[ghstack-poisoned]

fduwjj requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners August 3, 2023 23:15

pytorch-bot bot added the release notes: distributed (ddp) release notes category label Aug 3, 2023

fduwjj mentioned this pull request Aug 3, 2023

[2D][TP] Add DDP TP integration and unit test #103604

Closed

fduwjj added ciflow/trunk Trigger trunk jobs on your pull request module: dtensor distributed tensor tag release notes: distributed (dtensor) release notes category labels Aug 3, 2023

Update on "[2D][TP] Enable DDP TP integration with unit test"

e96640d

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 3, 2023

[2D][TP] Enable DDP TP integration with unit test

8a10e35

ghstack-source-id: 82c02d6cb4119a3eb23fcff7d51740efa9c997bb Pull Request resolved: #106583

awgu reviewed Aug 4, 2023

View reviewed changes

Update on "[2D][TP] Enable DDP TP integration with unit test"

d411fa3

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 4, 2023

[2D][TP] Enable DDP TP integration with unit test

82e101b

ghstack-source-id: 06dced7618a6887b143210bd59bf69ad20077d16 Pull Request resolved: #106583

fduwjj requested a review from kumpera August 4, 2023 03:42

kumpera approved these changes Aug 4, 2023

View reviewed changes

rohan-varma reviewed Aug 4, 2023

View reviewed changes

test/distributed/tensor/parallel/test_ddp_2d_parallel.py Outdated Show resolved Hide resolved

test/distributed/tensor/parallel/test_ddp_2d_parallel.py Outdated Show resolved Hide resolved

wanchaol reviewed Aug 7, 2023

View reviewed changes

fegin approved these changes Aug 15, 2023

View reviewed changes

Update on "[2D][TP] Enable DDP TP integration with unit test"

c1b598f

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 16, 2023

[2D][TP] Enable DDP TP integration with unit test

b3612ef

ghstack-source-id: a64981ad1765d6f4332c84bfd456a6d089a28292 Pull Request resolved: #106583

wanchaol approved these changes Aug 16, 2023

View reviewed changes

pytorchmergebot added the merging label Aug 16, 2023

pytorchmergebot added Merged and removed merging labels Aug 17, 2023

pytorchmergebot closed this in 983fd5b Aug 17, 2023

fduwjj mentioned this pull request Aug 17, 2023

Update test name in multiGPU test #107397

Closed

pytorchmergebot pushed a commit that referenced this pull request Aug 17, 2023

Update test name in multiGPU test (#107397)

37eb969

Pull Request resolved: #107397 Approved by: https://github.com/wanchaol ghstack dependencies: #107313, #106583

facebook-github-bot deleted the gh/fduwjj/103/head branch August 20, 2023 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2D][TP] Enable DDP TP integration with unit test #106583

[2D][TP] Enable DDP TP integration with unit test #106583

fduwjj commented Aug 3, 2023 •

edited

Loading

pytorch-bot bot commented Aug 3, 2023 •

edited

Loading

awgu Aug 4, 2023

awgu Aug 4, 2023

fduwjj Aug 4, 2023

fduwjj Aug 4, 2023

awgu Aug 4, 2023

fegin Aug 15, 2023 •

edited

Loading

awgu Aug 4, 2023

fduwjj Aug 4, 2023

awgu Aug 4, 2023

awgu Aug 4, 2023

fduwjj Aug 4, 2023

awgu Aug 4, 2023

fduwjj Aug 4, 2023

awgu Aug 4, 2023

kumpera left a comment

awgu commented Aug 4, 2023

fduwjj commented Aug 4, 2023

fduwjj commented Aug 4, 2023

fegin left a comment

wanchaol left a comment

fduwjj commented Aug 16, 2023

pytorchmergebot commented Aug 16, 2023

		_update_model_param(param_list) # type: ignore[arg-type]


		def pre_dp_model_transform(model: nn.Module):

[2D][TP] Enable DDP TP integration with unit test #106583

[2D][TP] Enable DDP TP integration with unit test #106583

Conversation

fduwjj commented Aug 3, 2023 • edited Loading

pytorch-bot bot commented Aug 3, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106583

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumpera left a comment

Choose a reason for hiding this comment

awgu commented Aug 4, 2023

fduwjj commented Aug 4, 2023

fduwjj commented Aug 4, 2023

fegin left a comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

fduwjj commented Aug 16, 2023

pytorchmergebot commented Aug 16, 2023

Merge started

fduwjj commented Aug 3, 2023 •

edited

Loading

pytorch-bot bot commented Aug 3, 2023 •

edited

Loading

fegin Aug 15, 2023 •

edited

Loading