Support multi-host TPUs #1337

r4victor · 2024-06-14T05:42:27Z

#1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.

Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.

Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.

r4victor · 2024-07-03T12:09:50Z

The most promising solution at the moment seems to be the instance-per-TPU-device model. Provisioning a multi-device TPU Pod creates an instance for each TPU device. For example, provisioning TPU v2-32 would create four TPU v2 instances. dstack pool ps shows each TPU device as a separate instance but instances can be grouped, e.g. by TPU Pod name. Users delete TPU Pods by specifying TPU Pod name (also can be the name of any instance in the Pod).

To support multi-node TPU tasks, dstack can determine nodes automatically based on the number of devices in the TPU Pod. For example, if a user specifies tpu-v2-32, dstack will run four jobs. An alternative solution would be to ask for a TPU type like tpu-v2 and determine the number of cores based on the number of jobs. The downside of the latter is that users won't be able to specify arbitrary number in nodes, so they'll need to calculate it depending on what TPU configuration they want to run.

Implementation details:

Compute.create_instance() should be able to return List[JobProvisioningData] to return provisioning data for each Pod device.
For every JobProvisioningData, InstanceModel needs to be created. Master job will occupy the newly provisioned device 0 instance. Other jobs will wait for master job provisioning and then occupy idle device instances from the pool.
Instances of a Pod must be grouped together. This can probably be done by introducing InstanceGroupModel of different types including "tpu_pod" type.

peterschmidt85 · 2024-08-03T01:49:12Z

This issue is stale because it has been open for 30 days with no activity.

r4victor added the ux label Jun 14, 2024

peterschmidt85 mentioned this issue Jun 24, 2024

[Roadmap] Q3 2024 #1350

Open

42 tasks

peterschmidt85 added the stale label Aug 3, 2024

peterschmidt85 mentioned this issue Aug 22, 2024

[Roadmap] Q2 2024 #1116

Closed

41 tasks

peterschmidt85 removed the stale label Aug 29, 2024

peterschmidt85 changed the title ~~Support multi-device TPU Pods~~ Support multi-host TPUs Sep 10, 2024

peterschmidt85 mentioned this issue Oct 3, 2024

[Roadmap] Q4 2024 #1782

Open

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-host TPUs #1337

Support multi-host TPUs #1337

r4victor commented Jun 14, 2024

r4victor commented Jul 3, 2024

peterschmidt85 commented Aug 3, 2024

Support multi-host TPUs #1337

Support multi-host TPUs #1337

Comments

r4victor commented Jun 14, 2024

r4victor commented Jul 3, 2024

peterschmidt85 commented Aug 3, 2024