Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

cw-tan · 2024-09-17T18:08:54Z

What does this PR do?

Single-line enhancement proposed in #2745, that is, to enable the propagation of the autograd graph after the all_gather operation. This is useful for syncing loss functions in a ddp setting.

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

📚 Documentation preview 📚: https://torchmetrics--2754.org.readthedocs.build/en/2754/

Borda · 2024-09-17T18:23:11Z

That sounds good to me, but can we add a test for this enhancement?

cw-tan · 2024-09-17T18:42:25Z

That sounds good to me, but can we add a test for this enhancement?

Thanks for the prompt response @Borda.

I'm thinking that _test_ddp_gather_uneven_tensors (here) and _test_ddp_gather_uneven_tensors_multidim (here) in tests/unittests/bases/test_ddp.py already cover the correctness of gather_all_tensors. I'm not sure what other ddp tests there are, but those tests should help tell us if the change I made isn't breaking existing functionality. Let me know if you had something else in mind for this.

I can make an additional unittest in tests/unittests/bases/test_ddp.py to give a tensor that requires_grad to gather_all_tensors, compute some scalar from them (proxy for a loss), and compute grads two ways (one going through the all_gather, one that doesn't) and compare. So this tests that the change achieves the desired effect. How does that sound?

codecov · 2024-09-17T19:01:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69%. Comparing base (748caee) to head (af23080).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #2754    +/-   ##
=======================================
- Coverage      69%     69%    -0%     
=======================================
  Files         329     316    -13     
  Lines       18077   17914   -163     
=======================================
- Hits        12496   12336   -160     
+ Misses       5581    5578     -3

Borda · 2024-09-17T19:04:33Z

I can make an additional unittest in tests/unittests/bases/test_ddp.py to give a tensor that requires_grad to gather_all_tensors, compute some scalar from them (proxy for a loss), and compute grads two ways (one going through the all_gather, one that doesn't) and compare. So this tests that the change achieves the desired effect. How does that sound?

yeah, that sounds good to me :)

cw-tan · 2024-09-18T04:27:50Z

Update: to accommodate both cases where tensors from different ranks have the same/different shape, the line to put the original tensor (holding the AD graph) back into the gathered list was added in two places in the code.

Because of the two cases, I wrote two unittests to account for each. Interestingly, both pass 2.X stable, but for 1.X LTS, the "same shape" test passes but "different shape" test fails, and for 1.10 oldest, the "different shape" test passes but "same shape" test fails😅. I'll double check for bugs, but the actual code change is just two lines (and all other tests pass, so existing functionality still works), and the unittests are pretty short. The dependency of the unittests passing on different torch versions seems to indicate that it might be a torch versioning issue, maybe to do with ddp behavior? Any thoughts, @Borda ?

Borda · 2024-09-19T09:16:26Z

I wrote two unittests to account for each. Interestingly, both pass 2.X stable, but for 1.X LTS, the "same shape" test passes but "different shape" test fails, and for 1.10 oldest, the "different shape" test passes but "same shape" test fails😅.

that is strange and worse some more investigation...
cc: @SkafteNicki

cw-tan requested review from SkafteNicki, Borda, justusschock and stancld as code owners September 17, 2024 18:08

Borda added the enhancement New feature or request label Sep 17, 2024

cw-tan force-pushed the all_gather_ad branch 3 times, most recently from 1ba6fb3 to 6598ab8 Compare September 18, 2024 00:40

cw-tan added 2 commits September 17, 2024 22:52

propagate rank result to gathered result for autograd compatibility

4b1e8d3

add unittest for dpp gather autograd compatibility

1d0dabe

cw-tan force-pushed the all_gather_ad branch from 6c926d7 to 1d0dabe Compare September 18, 2024 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

cw-tan commented Sep 17, 2024 •

edited by github-actions bot

Loading

Borda commented Sep 17, 2024

cw-tan commented Sep 17, 2024

codecov bot commented Sep 17, 2024 •

edited

Loading

Borda commented Sep 17, 2024

cw-tan commented Sep 18, 2024

Borda commented Sep 19, 2024

Enable autograd graph to propagate after multi-device syncing (for loss functions in ddp) #2754

Are you sure you want to change the base?

Enable autograd graph to propagate after multi-device syncing (for loss functions in ddp) #2754

Conversation

cw-tan commented Sep 17, 2024 • edited by github-actions bot Loading

What does this PR do?

Did you have fun?

Borda commented Sep 17, 2024

cw-tan commented Sep 17, 2024

codecov bot commented Sep 17, 2024 • edited Loading

Codecov Report

Borda commented Sep 17, 2024

cw-tan commented Sep 18, 2024

Borda commented Sep 19, 2024

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

Enable autograd graph to propagate after multi-device syncing (for loss functions in `ddp`) #2754

cw-tan commented Sep 17, 2024 •

edited by github-actions bot

Loading

codecov bot commented Sep 17, 2024 •

edited

Loading