0.18.4
Google Cloud TPU
This update introduces initial support for Google Cloud TPU.
To request a TPU, specify the TPU architecture prefixed by tpu-
(in gpu
under resources
):
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
Important
Currently, you can't specify other than 8 TPU cores. This means only single TPU device workloads are supported. Support for multiple TPU devices is coming soon.
Private subnets with GCP
Additionally, the update allows configuring the gcp
backend to use only private subnets. To achieve this, set public_ips
to false
.
projects:
- name: main
backends:
- type: gcp
creds:
type: default
public_ips: false
Major bug-fixes
Besides TPU, the update fixes a few important bugs.
- Fix
cudo
backend stuck && Improve docs forcudo
by @smokfyz in #1347 - Fix
nvidia-smi
not available onlambda
by @r4victor in #1357 - Respect
registry_auth
for RunPod by @smokfyz in #1333 - Support multi-node tasks on
oci
by @jvstme in #1334
Other
- Show warning on required
ssh
version by @loghijiaha in #1313 - Add OCI packer templates by @jvstme in #1316
- Support
oci
Bare Metal instances by @jvstme in #1325 - Support
oci
BM.Optimized3.36
instance by @jvstme in #1328 - [Docs] Update
dstack pool
docs by @jvstme in #1329 - Add TPU support in
gcp
by @Bihan in #1323 - Fix failing
runner-test
workflow by @r4victor in #1336 - Document OCI permissions by @jvstme in #1338
- Limit the gateway's open ports to
22
,80
, and443
by @smokfyz in #1335 - Update
serve.dstack.yml
- infinity by @michaelfeil in #1340 - Support instances without public IP for GCP by @smokfyz in #1341
- [Internal] Automate OCI images publishing by @jvstme in #1346
- Fix slow
/api/pools/list_instances
by @r4victor in #1320 - Respect
gcp
VPC config when provisioning TPUs by @r4victor in #1332 - [Internal] Fix linter errors by @jvstme in #1322
- TPU support enhancements by @r4victor in #1330
- TPU initial release by @Bihan in #1354
- TPUs fixes by @r4victor in #1360
- Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
- Even more flexible OCI client credentials by @jvstme in #1317
New contributors
- @loghijiaha made their first contribution in #1313
- @smokfyz made their first contribution in #1333
- @michaelfeil made their first contribution in #1340
Full changelog: 0.18.3...0.18.4