-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-host TPUs #1337
Comments
The most promising solution at the moment seems to be the instance-per-TPU-device model. Provisioning a multi-device TPU Pod creates an instance for each TPU device. For example, provisioning TPU v2-32 would create four TPU v2 instances. To support multi-node TPU tasks, dstack can determine Implementation details:
|
This issue is stale because it has been open for 30 days with no activity. |
#1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.
Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.
Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.
The text was updated successfully, but these errors were encountered: