Releases · microsoft/DeepSpeed

08 Nov 16:26

loadams

v0.15.4

a1b0c35

v0.15.4 Latest

Latest

What's Changed

Update version.txt after 0.15.3 release by @loadams in #6652
Fix expert grad scaling problem with ZeRO optimizer by @wyooyw in #6546
Add attribute check for language_model when replace last linear module by @Yejing-Lai in #6650
fix init_device_mesh for torch 2.4 by @Lzhang-hub in #6614
Fix dynamo issue by @oraluben in #6527
sequence parallel for uneven heads by @inkcherry in #6392
Add fallback for is_compiling by @tohtana in #6663
Update profiler registration check by @loadams in #6668
Add support for H100/sm_90 arch compilation by @loadams in #6669
Update Gaudi2 docker image by @loadams in #6677
Update gaudi2 docker version to latest release (1.18) by @raza-sikander in #6648
Update base docker image for A6000 GPU tests by @loadams in #6681
Remove packages that no longer need to be updated in the latest container by @loadams in #6682
Fix training of pipeline based peft's lora model by @xuanhua in #5477
Update checkout action to latest version by @loadams in #5021
Add attribute check to support git-base autotp by @Yejing-Lai in #6688
fix memcpy issue on backward for zero-infinity by @xylian86 in #6670
Free memory in universal checkpointing tests by @tohtana in #6693
Explictly set device when reusing dist env by @tohtana in #6696
Update URL in README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6706
Pin transformers to 4.45.2 in nv-ds-chat workflow by @loadams in #6710
[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 by @jagadish-amd in #6622
Use one param coordinator for both train/inference scenarios by @tohtana in #6662
Update yapf version by @loadams in #6721
Update flake8 version by @loadams in #6722
Switch what versions of python are supported by @loadams in #5676

New Contributors

@wyooyw made their first contribution in #6546
@xuanhua made their first contribution in #5477

Full Changelog: v0.15.3...v0.15.4

Contributors

oraluben, wyooyw, and 10 other contributors

Assets 2

22 Oct 21:12

jomayeri

v0.15.3

a24cdd6

v0.15.3

What's Changed

Update version.txt after 0.15.2 release by @loadams in #6615
Clean up prefetched parameters by @tohtana in #6557
AIO CPU Locked Tensor by @jomayeri in #6592
reduce setting global variables to reduce torch compile graph breaks by @NirSonnenschein in #6541
Add API to get devices of offload states by @tohtana in #6586
Ignore reuse_dist_env by @tohtana in #6623
Add API for updating ZeRO gradients by @tjruwase in #6590
[compile] Show breakdown of graph break by @delock in #6601
Accept btl_tcp_if_include option through launcher_args by @diskkid in #6613
Add first Step in LR Schedulers by @jomayeri in #6597
Support safetensors export by @xu-song in #6579
add option to disable logger while compiling to avoid graph breaks by @ShellyNR in #6496
Lock cache file of HF model list by @tohtana in #6628
Add README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6588
Update torch version in workflows by @tohtana in #6631
Use file store for tests by @tohtana in #6632
Fix Memory Leak In AIO by @jomayeri in #6630
[XPU] upgrade xpu max1100 CI workflow to pytorch2.3 by @Liangliang-Ma in #6646
[XPU] host timer check version from Torch 2.5 to Torch 2.6 by @YizhouZ in #6633
[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda by @Liangliang-Ma in #6645

New Contributors

@diskkid made their first contribution in #6613
@ShellyNR made their first contribution in #6496

Full Changelog: v0.15.2...v0.15.3

Contributors

tjruwase, diskkid, and 10 other contributors

Assets 2

09 Oct 17:46

jomayeri

v0.15.2

474a328

v0.15.2 Patch Release

What's Changed

Update version.txt after 0.15.1 release by @loadams in #6493
HPU: add required ENV vars to acccelerator init by @nelyahu in #6495
Op_builder->is_compatible quite warning by @terry-for-github in #6093
fix pipeline eval_batch micro_batches argument for schedule by @nelyahu in #6484
Fix the broken url link by @rogerxfeng8 in #6500
fix environment variable export bug for MultiNodeRunner by @TideDra in #5878
Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook" by @nelyahu in #6508
wrap include cuda_bf16.h with ifdef BF16_AVAILABLE by @oelayan7 in #6520
Avoid security issues of subprocess shell by @tjruwase in #6498
Add conditional on torch version for scaled_dot_product_attention by @loadams in #6517
Added Intel Gaudi to Accelerator Setup Guide by @ShifaAbu in #6543
Skip failing newly added tests in accelerate by @loadams in #6574
Use msgpack for p2p comm by @tohtana in #6547
DeepNVMe perf tuning by @tjruwase in #6560
[Accelerator] Cambricon MLU support by @Andy666G in #6472
Fix gradient accumulation for Z2+offload by @tohtana in #6550
fix errors when setting zero3 leaf modules with torch.compile by @NirSonnenschein in #6564
[XPU] Support DeepNVMe new code structure by @Liangliang-Ma in #6532
Add APIs to offload states of model, optimizer, and engine by @tohtana in #6011
add bfloat16 to inference support dtypes by @nelyahu in #6528
[COMPILE] workflow for deepspeed + torch.compile by @YizhouZ in #6570
Fixes on the accelerate side mean we do not need to skip this test by @loadams in #6583
Fix torch include in op_builder/mlu/fused_adam.py and update no-torch workflow triggers by @loadams in #6584
[ROCm] Fix subprocess error by @jagadish-amd in #6587
Cleanup CODEOWNERS file to be valid by @loadams in #6603
Add SSF Best practices badge by @loadams in #6604
Move V100 workflows from cuda 11.1/11.7 to 12.1 by @loadams in #6607
Fix SD workflow by @loadams in #6609
Pin accelerate to fix CI failures/issues by @loadams in #6610
Add llama3.2 vision autotp by @Yejing-Lai in #6577
Improve DS logging control by @tjruwase in #6602
Fix device selection using CUDA_VISIBLE_DEVICES by @tohtana in #6530
Handle when backend is also in compile_kwargs by @oraluben in #6502
Rearrange inference OPS and stop using builder.load by @oelayan7 in #5490
Unpin accelerate tests, update lightning with node16 removal. by @loadams in #6611
Enabled Qwen2-MoE Tensor Parallelism (TP) inference by @gyou2021 in #6551

New Contributors

@TideDra made their first contribution in #5878
@ShifaAbu made their first contribution in #6543
@jagadish-amd made their first contribution in #6587
@gyou2021 made their first contribution in #6551

Full Changelog: v0.15.1...v0.15.2

Contributors

tjruwase, oraluben, and 15 other contributors

Assets 2

05 Sep 01:30

loadams

v0.15.1

10ba3dd

v0.15.1 Patch release

What's Changed

Update version.txt after 0.15.0 release by @loadams in #6403
Fix Type Mismatch by @jomayeri in #6410
Fix redundant seq data parallel grp argument in Z3/MiCS by @samadejacobs in #5352
add Huawei Ascend NPU setup guide by @xuedinge233 in #6445
Add documentation for launcher without SSH by @dogacancolak-kensho in #6455
Dtype support check for accelerator in UTs by @raza-sikander in #6360
Store/Load CIFAR from local/offline by @raza-sikander in #6390
Add the accelerator setup guide link in Getting Started page by @rogerxfeng8 in #6452
Allow triton==3.0.x for fp_quantizer by @siddartha-RE in #6447
Change GDS to 1 AIO thread by @jomayeri in #6459
[CCL] fix condition issue in ccl.py by @YizhouZ in #6443
Avoid gds build errors on ROCm by @rraminen in #6456
TestLowCpuMemUsage UT get device by device_name by @raza-sikander in #6397
Add workflow to build DS without torch to better test before releases by @loadams in #6450
Fix patch for parameter partitioning in zero.Init() by @tohtana in #6388
Add default value to "checkpoint_folder" in "load_state_dict" of bf16_optimizer by @ljcc0930 in #6446
DeepNVMe tutorial by @tjruwase in #6449
bf16_optimizer: fixes to different grad acc dtype by @nelyahu in #6485
print warning if actual triton cache dir is on NFS, not just for default by @jrandall in #6487
DS_BUILD_OPS should build only compatible ops by @tjruwase in #6489
Safe usage of popen by @tjruwase in #6490
Handle an edge case where CUDA_HOME is not defined on ROCm systems by @amorehead in #6488

New Contributors

@xuedinge233 made their first contribution in #6445
@siddartha-RE made their first contribution in #6447
@ljcc0930 made their first contribution in #6446
@jrandall made their first contribution in #6487
@amorehead made their first contribution in #6488

Full Changelog: v0.15.0...v0.15.1

Contributors

jrandall, tjruwase, and 14 other contributors

Assets 2

22 Aug 22:46

loadams

v0.15.0

55b4cae

DeepSpeed v0.15.0

What's Changed

Update version.txt after 0.14.5 release by @loadams in #5982
move pynvml install to setup.py by @Rohan138 in #5840
add moe topk(k>2) gate support by @inkcherry in #5881
Move inf_or_nan_tracker to cpu for cpu offload by @BacharL in #5826
Enable dynamic shapes for pipeline parallel engine inputs by @tohtana in #5481
Add and Remove ZeRO 3 Hooks by @jomayeri in #5658
DeepNVMe GDS by @jomayeri in #5852
Pin transformers version on nv-nightly by @loadams in #6002
DeepSpeed on Window blog by @tjruwase in #6364
Bug Fix 5880 by @jomayeri in #6378
Update linear.py compatible with torch 2.4.0 by @terry-for-github in #5811
GDS Swapping Fix by @jomayeri in #6386
Long sequence parallelism (Ulysses) integration with HuggingFace by @samadejacobs in #5774
reduce cpu host overhead when using moe by @ranzhejiang in #5578
fix fp16 Qwen2 series model to DeepSpeed-FastGen by @ZonePG in #6028
Add Japanese translation of Windows support blog by @tohtana in #6394
Correct op_builder path to xpu files for trigger XPU tests by @loadams in #6398
add pip install cutlass version check by @GuanhuaWang in #6393
[XPU] API align with new intel pytorch extension release by @YizhouZ in #6395
Pydantic v2 migration by @mrwyattii in #5167
Fix torch check by @loadams in #6402

New Contributors

@Rohan138 made their first contribution in #5840
@terry-for-github made their first contribution in #5811
@ranzhejiang made their first contribution in #5578

Full Changelog: v0.14.5...v0.15.0

Contributors

tjruwase, GuanhuaWang, and 12 other contributors

Assets 2

15 Aug 18:04

loadams

v0.14.5

eb07d41

v0.14.5 Patch release

What's Changed

Update version.txt after 0.14.4 release by @mrwyattii in #5694
Fixed Windows inference build. by @costin-eseanu in #5609
Fix memory leak from _hp_mapping by @chiragjn in #5643
Bug fix for the "Link bit16 and fp32 parameters in partition" by @U-rara in #5681
[CPU] add fp16 support to shm inference_all_reduce by @delock in #5669
Universal checkpoint for zero stage 3 by @xylian86 in #5475
inference unit test injectionPolicy split world_size to multiple tests by @oelayan7 in #5687
ENV var added for recaching in INF Unit tests by @raza-sikander in #5688
Disable nvtx decorator to avoid graph break by @tohtana in #5697
Add an argument to enable the injection of missing state during the conversion of universal checkpoints by @xylian86 in #5608
Change source of CPUAdam for xpu accelerator by @Liangliang-Ma in #5703
Add additional paths to trigger xpu tests by @loadams in #5707
Update XPU docker version by @loadams in #5712
update xpu fusedadam opbuilder for pytorch 2.3 by @baodii in #5702
DeepSpeed Universal Checkpointing: Blog and Tutorial by @samadejacobs in #5711
UCP Chinese Blog by @HeyangQin in #5713
Fix tutorial links by @samadejacobs in #5714
Update node16 check on self-hosted runners and remove python 3.6 by @loadams in #5756
fix the missing argument in test and typo by @xylian86 in #5730
[INF] Enable torch compile for inference by @oelayan7 in #5612
Update checkout action for nv-human-eval workflow by @loadams in #5757
Add Windows scripts (deepspeed, ds_report). by @costin-eseanu in #5699
Unit Test: Add error handling for rate limit exceeded in model list by @HeyangQin in #5715
Fix memory leak for pipelined optimizer swapper by @mauryaavinash95 in #5700
Remove duplicated variable by @xu-song in #5727
Fix phi3 mini 128k load error by @Yejing-Lai in #5765
[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph by @delock in #5604
Added wrappers for hpu tensors based on dtype by @deepcharm in #5771
[bugfix] promote state in bf16_optimizer by @billishyahao in #5767
Launcher mode with SSH bypass by @dogacancolak-kensho in #5728
Update the list of supported models in the Chinese README of fastgen by @beep-bebop in #5773
Add support for Microsoft Phi-3 model to DeepSpeed-FastGen by @adk9 in #5559
Misplaced global variable warned by @anferico in #5725
Fixes for latest Huggingface_hub changes on modelId -> id by @loadams in #5789
reduce all-to-all communication volume when both expert and non-expert are tensor-parallel by @taozhiwei in #5626
Update Ubuntu version for running python tests by @loadams in #5783
fix: quantization with DeepSpeed HE by @Atry in #5624
[INF] Add Qwen2RMSNorm to loaded layers in auto_tp by @oelayan7 in #5786
Add chatglm2 & chatglm3 autotp by @Yejing-Lai in #5540
Add new autotp supported model in doc by @Yejing-Lai in #5785
Fix accuracy error of NPUFusedAdam by @penn513 in #5777
Update torch version in cpu-torch-latest and nv-torch-latest-v100 tests to 2.4 by @loadams in #5797
move is_checkpointable call reducing torch.compile Graph breaks by @NirSonnenschein in #5759
Unpin transformers version by @loadams in #5650
Update other workflows to run on Ubuntu 22.04 by @loadams in #5798
[XPU]Use host time to replace xpu time when IPEX version slower than 2.5. by @ys950902 in #5796
Update MII tests to pull correct torchvision by @loadams in #5800
Add fp8-fused gemm kernel by @sfc-gh-reyazda in #5764
Add doc of compressed backend in Onebit optimizers by @Liangliang-Ma in #5782
fix: handle exception when loading cache file in test_inference.py by @HeyangQin in #5802
Pin transformers version for MII tests by @loadams in #5807
Fix op_builder for CUDA 12.5 by @keshavkowshik in #5806
Find ROCm on Fedora by @trixirt in #5705
Fix CPU Adam JIT compilation by @lekurile in #5780
GDS AIO Blog by @jomayeri in #5817
[ROCm] Get rocm version from /opt/rocm/.info/version by @rraminen in #5815
sequence parallel with communication overlap by @inkcherry in #5691
Update to ROCm6 by @loadams in #5491
Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen by @ZonePG in #5403
Use accelerator to replace cuda in setup and runner by @Andy666G in #5769
Link GDS blog to site by @tjruwase in #5820
Non-reentrant checkpointing hook fix by @ic-synth in #5781
Fix NV references by @tjruwase in #5821
Fix docs building guide by @tjruwase in #5825
Update clang-format version from 16 to 18. by @loadams in #5839
Add Japanese translation of DeepNVMe blog by @tohtana in #5845
Fix the bug of deepspeed sequence parallel working with batch size larger than 1 by @YJHMITWEB in #5823
Upgrade HPU image to v1.16.2. by @vshekhawat-hlab in #5610
OptimizedLinear updates by @jeffra in #5791
Log operator warnings only in verbose mode by @tjruwase in #5917
Use torch.nan_to_num replace numpy wrapper one by @jinyouzhi in #5877
[Zero2] Reduce the unnecessary all-reduce when tensor size is 0. by @ys950902 in #5868
Update container version for Gaudi2 CI by @raza-sikander in #5937
Fix missing ds_id bug by @tjruwase in #5824
Update LR scheduler configuration by @xiyang-aads-lilly in #5846
HPUAccelerator: remove support in set_visible_devices_envs by @nelyahu in #5929
Z3: optimizations for grad norm calculation and gradient clipping by @nelyahu in #5504
Update xpu-max1100.yml with new config and add some tests by @Liangliang-Ma in #5668
Add accelerator setup guides by @delock in #5827
Allow accelerator to instantiate the device by @nelyahu in #5255

New Contributors

@U-rara made their first contribution in #5681
@xylian86 made their first contribution in #5475
@mauryaavinash95 made their first contribution in #5700
@billishyahao made their first contribution in #5767
@dogacancolak-kensho made their first contribution in #5728
@beep-bebop made their first contribution in #5773
@anferico made their first contribution in #5725
@Atry made their first contribution in #5624
@sfc-gh-reyazda made their first contribution in https://github.com/...

Contributors

adk9, Atry, and 43 other contributors

Assets 2

21 Jun 17:33

loadams

v0.14.4

d254d75

v0.14.4 Patch release

What's Changed

Update version.txt after 0.14.3 release by @mrwyattii in #5651
[CPU] SHM based allreduce improvement for small message size by @delock in #5571
_exec_forward_pass: place zeros(1) on the same device as the param by @nelyahu in #5576
[XPU] adapt lazy_call func to different versions by @YizhouZ in #5670
fix IDEX dependence in xpu accelerator by @Liangliang-Ma in #5666
Remove compile wrapper to simplify access to model attributes by @tohtana in #5581
Fix hpZ with zero element by @samadejacobs in #5652
Fixing the reshape bug in sequence parallel alltoall, which corrupted all QKV data by @YJHMITWEB in #5664
enable yuan autotp & add conv tp by @Yejing-Lai in #5428
Fix latest pytorch '_get_socket_with_port' import error by @Yejing-Lai in #5654
Fix numpy upgrade to 2.0.0 BUFSIZE import error by @Yejing-Lai in #5680
Update BUFSIZE to come from autotuner's constants.py, not numpy by @loadams in #5686
[XPU] support op builder from intel_extension_for_pytorch kernel path by @YizhouZ in #5425

New Contributors

@YJHMITWEB made their first contribution in #5664

Full Changelog: v0.14.3...v0.14.4

Contributors

samadejacobs, Liangliang-Ma, and 8 other contributors

Assets 2

12 Jun 18:14

mrwyattii

v0.14.3

54f98fd

v0.14.3 Patch release

What's Changed

Update version.txt after 0.14.2 release by @mrwyattii in #5458
Add getter and setter methods for compile_backend across accelerators. by @vshekhawat-hlab in #5299
Fix torch.compile error for PyTorch v2.3 by @tohtana in #5463
Revert "stage3: efficient compute of scaled_global_grad_norm (#5256)" by @lekurile in #5461
Update ds-chat CI workflow paths to include zero stage 1-3 files by @lekurile in #5462
Update with ops not supported on Windows by @loadams in #5468
fix: swapping order of parameters in create_dir_symlink method. by @alvieirajr in #5465
Un-pin torch version in nv-torch-latest back to latest and skip test_compile_zero tests on v100 by @loadams in #5459
re-introduce: stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5493
Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() by @harygo2 in #5464
Fix compile wrapper by @BacharL in #5455
enable phi3_mini autotp by @Yejing-Lai in #5501
Fused adam for HPU by @BacharL in #5500
[manifest] update mainfest to add hpp file in csrc. by @ys950902 in #5522
enable phi2 autotp by @Yejing-Lai in #5436
Switch pynvml to nvidia-ml-py by @loadams in #5529
Switch from double quotes to match single quotes by @loadams in #5530
[manifest] update mainfest to add hpp file in deepspeed. by @ys950902 in #5533
New integration - CometMonitor by @alexkuzmik in #5466
Improve _configure_optimizer() final optimizer log by @nelyahu in #5528
Enhance testing: Skip fused_optimizer tests if not supported. by @vshekhawat-hlab in #5159
Skip the UT cases that use unimplemented op builders. by @foin6 in #5372
rocblas -> hipblas changes for ROCm by @rraminen in #5401
Rocm warp size fix by @rraminen in #5402
CPUAdam fp16 and bf16 support by @BacharL in #5409
Optimize zero3 fetch params using all_reduce by @deepcharm in #5420
Fix the TypeError for XPU Accelerator by @shiyang-weng in #5531
Fix RuntimeError for moe on XPU: tensors found at least two devices by @shiyang-weng in #5519
Remove synchronize calls from allgather params by @BacharL in #5516
Avoid overwrite of compiled module wrapper attributes by @deepcharm in #5549
Small typos in functions set_none_gradients_to_zero by @TravelLeraLone in #5557
Adapt doc for #4405 by @oraluben in #5552
Update to HF_HOME from TRANSFORMERS_CACHE by @loadams in #4816
[INF] DSAttention allow input_mask to have false as value by @oelayan7 in #5546
Add throughput timer configuration by @deepcharm in #5363
Add Ulysses DistributedAttention compatibility by @Kwen-Chen in #5525
Add hybrid_engine.py as path to trigger the DS-Chat GH workflow by @lekurile in #5562
Update HPU docker version by @loadams in #5566
Rename files in fp_quantize op from quantize.* to fp_quantize.* by @loadams in #5577
[MiCS] Remove the handle print on DeepSpeed side by @ys950902 in #5574
Update to fix sidebar over text by @loadams in #5567
DeepSpeedCheckpoint: support custom final ln idx by @nelyahu in #5506
Update minor CUDA version compatibility by @adk9 in #5591
Add slide deck for meetup in Japan by @tohtana in #5598
Fixed the Windows build. by @costin-eseanu in #5596
estimate_zero2_model_states_mem_needs: fixing memory estiamtion by @nelyahu in #5099
Fix cuda hardcode for inference woq by @Liangliang-Ma in #5565
fix sequence parallel(Ulysses) grad scale for zero0 by @inkcherry in #5555
Add Compressedbackend for Onebit optimizers by @Liangliang-Ma in #5473
Updated hpu-gaudi2 tests content. by @vshekhawat-hlab in #5622
Pin transformers version for MII tests by @loadams in #5629
WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo by @NirSonnenschein in #5590
stage_1_and_2: optimize clip calculation to use clamp by @nelyahu in #5632
Fix overlap communication of ZeRO stage 1 and 2 by @penn513 in #5606
fixes in _partition_param_sec function by @mmhab in #5613
assumption of torch.initial_seed function accepting seed arg in DeepSpeedAccelerator abstract class is incorrect by @polisettyvarma in #5569
pipe/_exec_backward_pass: fix immediate grad update by @nelyahu in #5605
Monitor was always enabled causing performance degradation by @deepcharm in #5633

New Contributors

@alvieirajr made their first contribution in #5465
@harygo2 made their first contribution in #5464
@alexkuzmik made their first contribution in #5466
@foin6 made their first contribution in #5372
@shiyang-weng made their first contribution in #5531
@TravelLeraLone made their first contribution in #5557
@oraluben made their first contribution in #5552
@Kwen-Chen made their first contribution in #5525
@adk9 made their first contribution in #5591
@costin-eseanu made their first contribution in #5596
@NirSonnenschein made their first contribution in #5590
@penn513 made their first contribution in #5606

Full Changelog: v0.14.2...v0.14.3

Contributors

adk9, oraluben, and 26 other contributors

Assets 2

23 Apr 23:25

loadams

v0.14.2

5f631ab

v0.14.2 Patch release

What's Changed

Update version.txt after 0.14.1 release by @mrwyattii in #5413
Remove dtype(fp16) condition check for residual_add unit test by @raza-sikander in #5329
[XPU] Use non_daemonic_proc by default on XPU device by @ys950902 in #5412
Fix a convergence issues in TP topology caused by incorrect grad_norm. by @inkcherry in #5411
Update 'create-pr' action in release workflow to latest by @loadams in #5415
Update engine.py to avoid torch warning by @etiennebonnafoux in #5408
Update _sidebar.scss by @fasterinnerlooper in #5293
Add more tests into XPU CI by @Liangliang-Ma in #5427
[CPU] Support SHM based inference_all_reduce in TorchBackend by @delock in #5391
Add required paths to trigger AMD tests on PRs by @loadams in #5406
Bug fix in split_index method by @bm-synth in #5292
Parallel map step for DistributedDataAnalyzer map-reduce by @bm-synth in #5291
Selective dequantization by @RezaYazdaniAminabadi in #5375
Fix sorting of shard optimizer states files for universal checkpoint by @tohtana in #5395
add device config env for the accelerator by @shiyuan680 in #5396
64bit indexing fused adam by @garrett4wade in #5187
Improve parallel process of universal checkpoint conversion by @tohtana in #5343
set the default to use set_to_none for clearing gradients in BF16 optimizer. by @inkcherry in #5434
OptimizedLinear implementation by @jeffra in #5355
Update README.md by @Jhonso7393 in #5453
Update PyTest torch version to match PyTorch latest official (2.3.0) by @loadams in #5454

New Contributors

@etiennebonnafoux made their first contribution in #5408
@fasterinnerlooper made their first contribution in #5293
@shiyuan680 made their first contribution in #5396
@garrett4wade made their first contribution in #5187
@Jhonso7393 made their first contribution in #5453

Full Changelog: v0.14.1...v0.14.2

Contributors

jeffra, fasterinnerlooper, and 14 other contributors

Assets 2

15 Apr 19:51

loadams

v0.14.1

e3d873a

v0.14.1 Patch release

What's Changed

Update version.txt after 0.14.0 release by @mrwyattii in #5238
Fp6 blog chinese by @xiaoxiawu-microsoft in #5239
Add contributed HW support into README by @delock in #5240
Set tp world size to 1 in ckpt load, if MPU is not provided by @samadejacobs in #5243
Make op builder detection adapt to accelerator change by @delock in #5206
Replace HIP_PLATFORM_HCC with HIP_PLATFORM_AMD by @rraminen in #5264
Add CI for Habana Labs HPU/Gaudi2 by @loadams in #5244
Fix attention mask handling in the Hybrid Engine Bloom flow by @deepcharm in #5101
Skip 1Bit Compression and sparsegrad tests for HPU. by @vshekhawat-hlab in #5270
Enabled LMCorrectness inference tests on HPU. by @vshekhawat-hlab in #5271
Added HPU backend support for torch.compile tests. by @vshekhawat-hlab in #5269
Average only valid part of the ipg buffer. by @BacharL in #5268
Add HPU accelerator support in unit tests. by @vshekhawat-hlab in #5162
Fix loading a universal checkpoint by @tohtana in #5263
Add Habana Gaudi2 CI badge to the README by @loadams in #5286
Add intel gaudi to contributed HW in README by @BacharL in #5300
Fixed Accelerate Link by @wkaisertexas in #5314
Enable mixtral 8x7b autotp by @Yejing-Lai in #5257
support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix by @inkcherry in #5259
fix comms dtype by @mayank31398 in #5297
Modified regular expression by @igeni in #5306
Docs typos fix and grammar suggestions by @Gr0g0 in #5322
Added Gaudi2 CI tests. by @vshekhawat-hlab in #5275
Improve universal checkpoint by @tohtana in #5289
Increase coverage for HPU by @loadams in #5324
Add NFS path check for default deepspeed triton cache directory by @HeyangQin in #5323
Correct typo in checking on bf16 unit test support by @loadams in #5317
Make NFS warning print only once by @HeyangQin in #5345
resolve KeyError: 'PDSH_SSH_ARGS_APPEND' by @Lzhang-hub in #5318
BF16 optimizer: Clear lp grads after updating hp grads in hook by @YangQun1 in #5328
Fix sort of zero checkpoint files by @tohtana in #5342
Add distributed_port for deepspeed.initialize by @LZHgrla in #5260
[fix] fix typo s/simultanenously /simultaneously by @digger-yu in #5359
Update container version for Gaudi2 CI by @raza-sikander in #5360
compute global norm on device by @BacharL in #5125
logger update with torch master changes by @rogerxfeng8 in #5346
Ensure capacity does not exceed number of tokens by @jeffra in #5353
Update workflows that use cu116 to cu117 by @loadams in #5361
FP [6,8,12] quantizer op by @jeffra in #5336
CPU SHM based inference_all_reduce improve by @delock in #5320
Auto convert moe param groups by @jeffra in #5354
Support MoE for pipeline models by @mosheisland in #5338
Update pytest and transformers with fixes for pytest>= 8.0.0 by @loadams in #5164
Increase CI coverage for Gaudi2 accelerator. by @vshekhawat-hlab in #5358
Add CI for Intel XPU/Max1100 by @Liangliang-Ma in #5376
Update path name on xpu-max1100.yml, add badge in README by @loadams in #5386
Update checkout action on workflows on ubuntu 20.04 by @loadams in #5387
Cleanup required_torch_version code and references. by @loadams in #5370
Update README.md for intel XPU support by @Liangliang-Ma in #5389
Optimize the fp-dequantizer to get high memory-BW utilization by @RezaYazdaniAminabadi in #5373
Removal of cuda hardcoded string with get_device function by @raza-sikander in #5351
Add custom reshaping for universal checkpoint by @tohtana in #5390
fix pagable h2d memcpy by @GuanhuaWang in #5301
stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5256
Fix the FP6 kernels compilation problem on non-Ampere GPUs. by @JamesTheZ in #5333

New Contributors

@vshekhawat-hlab made their first contribution in #5270
@wkaisertexas made their first contribution in #5314
@igeni made their first contribution in #5306
@Gr0g0 made their first contribution in #5322
@Lzhang-hub made their first contribution in #5318
@YangQun1 made their first contribution in #5328
@raza-sikander made their first contribution in #5360
@rogerxfeng8 made their first contribution in #5346
@JamesTheZ made their first contribution in #5333

Full Changelog: v0.14.0...v0.14.1

Contributors

jeffra, igeni, and 28 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: microsoft/DeepSpeed

v0.15.4

What's Changed

New Contributors

Contributors

v0.15.3

What's Changed

New Contributors

Contributors

v0.15.2 Patch Release

What's Changed

New Contributors

Contributors

v0.15.1 Patch release

What's Changed

New Contributors

Contributors

DeepSpeed v0.15.0

What's Changed

New Contributors

Contributors

v0.14.5 Patch release

What's Changed

New Contributors

Contributors

v0.14.4 Patch release

What's Changed

New Contributors

Contributors

v0.14.3 Patch release

What's Changed

New Contributors

Contributors

v0.14.2 Patch release

What's Changed

New Contributors

Contributors

v0.14.1 Patch release

What's Changed

New Contributors

Contributors