Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Distributed] use device group for all broadcast #5320

Closed

Conversation

youkaichao
Copy link
Member

A fix to #4444

That PR was originally tested on A100, and it showed some speedup.

However, later it seems to slow down on H100 and other machines. The hypothesis is that gloo performs poorly while nvlink is better in these high-end machines.

Before we find a good solution, we can just use device group for all the broadcast.

TODO:

investigate if it is possible and beneficial to only communicate cpu data, using mechanisms such as message queue.

@WoosukKwon
Copy link
Collaborator

@youkaichao Could you add some performance numbers on H100? I'm wondering how this affects the performance.

@youkaichao
Copy link
Member Author

see https://docs.google.com/spreadsheets/d/1c9xgR0fGvm6SROfk7vrjwOZdYnKQk9oOafWK4_KgOyo/edit#gid=593626425 for reference. in particular, check the gloo and nccl part, around byte size 1k ~ 2k. That's the rough data size we broadcast twice.

@youkaichao
Copy link
Member Author

close as #5399 will be better.

@youkaichao youkaichao closed this Jun 15, 2024
@youkaichao youkaichao deleted the change_broadcast_group branch June 15, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants