Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-host TPUs #1337

Open
Tracked by #1782
r4victor opened this issue Jun 14, 2024 · 2 comments
Open
Tracked by #1782

Support multi-host TPUs #1337

r4victor opened this issue Jun 14, 2024 · 2 comments
Labels

Comments

@r4victor
Copy link
Collaborator

#1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.

Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.

Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.

@r4victor r4victor added the ux label Jun 14, 2024
@r4victor
Copy link
Collaborator Author

r4victor commented Jul 3, 2024

The most promising solution at the moment seems to be the instance-per-TPU-device model. Provisioning a multi-device TPU Pod creates an instance for each TPU device. For example, provisioning TPU v2-32 would create four TPU v2 instances. dstack pool ps shows each TPU device as a separate instance but instances can be grouped, e.g. by TPU Pod name. Users delete TPU Pods by specifying TPU Pod name (also can be the name of any instance in the Pod).

To support multi-node TPU tasks, dstack can determine nodes automatically based on the number of devices in the TPU Pod. For example, if a user specifies tpu-v2-32, dstack will run four jobs. An alternative solution would be to ask for a TPU type like tpu-v2 and determine the number of cores based on the number of jobs. The downside of the latter is that users won't be able to specify arbitrary number in nodes, so they'll need to calculate it depending on what TPU configuration they want to run.

Implementation details:

  • Compute.create_instance() should be able to return List[JobProvisioningData] to return provisioning data for each Pod device.
  • For every JobProvisioningData, InstanceModel needs to be created. Master job will occupy the newly provisioned device 0 instance. Other jobs will wait for master job provisioning and then occupy idle device instances from the pool.
  • Instances of a Pod must be grouped together. This can probably be done by introducing InstanceGroupModel of different types including "tpu_pod" type.

@peterschmidt85
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@peterschmidt85 peterschmidt85 mentioned this issue Aug 22, 2024
41 tasks
@peterschmidt85 peterschmidt85 changed the title Support multi-device TPU Pods Support multi-host TPUs Sep 10, 2024
@peterschmidt85 peterschmidt85 mentioned this issue Oct 3, 2024
49 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants