Ray-2.36.0
Ray Libraries
Ray Data
💫 Enhancements:
- Remove limit on number of tasks launched per scheduling step (#47393)
- Allow user-defined Exception to be caught. (#47339)
🔨 Fixes:
- Display pending actors separately in the progress bar and not count them towards running resources (#46384)
- Fix bug where
arrow_parquet_args
aren't used (#47161) - Skip empty JSON files in
read_json()
(#47378) - Remove remote call for initializing
Datasource
inread_datasource()
(#47467) - Remove dead
from_*_operator
modules (#47457) - Release test fixes
- Add
AWS ACCESS_DENIED
as retryable exception for multi-node Data+Train benchmarks (#47232) - Get AWS credentials with boto (#47352)
- Use worker node instead of head node for
read_images_comparison_microbenchmark_single_node
release test (#47228)
📖 Documentation:
- Add docstring to explain
Dataset.deserialize_lineage
(#47203) - Add a comment explaining the bundling behavior for
map_batches
with default batch_size (#47433)
Ray Train
💫 Enhancements:
- Decouple device-related modules and add Huawei NPU support to Ray Train (#44086)
🔨 Fixes:
- Update TORCH_NCCL_ASYNC_ERROR_HANDLING env var (#47292)
📖 Documentation:
- Add missing Train public API reference (#47134)
Ray Tune
📖 Documentation:
- Add missing Tune public API references (#47138)
Ray Serve
💫 Enhancements:
- Mark proxy as unready when its routers are aware of zero replicas (#47002)
- Setup default serve logger (#47229)
🔨 Fixes:
- Allow get_serve_logs_dir to run outside of Ray's context (#47224)
- Use serve logger name for logs in serve (#47205)
📖 Documentation:
- [HPU] [Serve] [experimental] Add vllm HPU support in vllm example (#45893)
🏗 Architecture refactoring:
- Remove support for nested DeploymentResponses (#47209)
RLlib
🎉 New Features:
- New API stack: Add CQL algorithm. (#47000, #47402)
- New API stack: Enable GPU and multi-GPU support for DQN/SAC/CQL. (#47179)
💫 Enhancements:
- New API stack: Offline RL enhancements: #47195, #47359
- Enhance new API stack stability: #46324, #47196, #47245, #47279
- Fix large batch size for synchronous algos (e.g. PPO) after EnvRunner failures. (#47356)
- Add torch.compile config options to old API stack. (#47340)
- Add kwargs to torch.nn.parallel.DistributedDataParallel (#47276)
- Enhanced CI stability: #47197, #47249
📖 Documentation:
- New API stack example scripts:
- Remove "new API stack experimental" hint from docs. (#47301)
🏗 Architecture refactoring:
- Remove 2nd Learner ConnectorV2 pass from PPO (#47401)
- Add separate learning rates for policy and alpha to SAC. (#47078)
🔨 Fixes:
Ray Core
💫 Enhancements:
- [ADAG] Raise proper error message for nccl within the same actor (#47250)
- [ADAG] Support multi-read of the same shm channel (#47311)
- Log why core worker is not idle during HandleExit (#47300)
- Add PREPARED state for placement groups in GCS for better fault tolerance. (#46858)
🔨 Fixes:
- Fix ray_unintentional_worker_failures_total to only count unintentional worker failures (#47368)
- Fix runtime env race condition when uploading the same package concurrently (#47482)
Dashboard
🔨 Fixes:
- Performance optimizations for dashboard backend logic (#47392) (#47367) (#47160) (#47213)
- Refactor to simplify dashboard backend logic (#47324)
Docs
💫 Enhancements:
- Add sphinx-autobuild and documentation for make local (#47275): Speed up of local docs builds with
make local
. - Add Algolia search to docs (#46477)
- Update PyTorch Mnist Training doc for KubeRay 1.2.0 (#47321)
- Life-cycle of documentation policy of Ray APIs
Thanks
Many thanks to all those who contributed to this release!
@GeneDer, @Bye-legumes, @nikitavemuri, @kevin85421, @MortalHappiness, @LeoLiao123, @saihaj, @rmcsqrd, @bveeramani, @zcin, @matthewdeng, @raulchen, @mattip, @jjyao, @ruisearch42, @scottjlee, @can-anyscale, @khluu, @aslonnie, @rynewang, @edoakes, @zhanluxianshen, @venkatram-dev, @c21, @allenyin55, @alexeykudinkin, @snehakottapalli, @BitPhinix, @hongchaodeng, @dengwxn, @liuxsh9, @simonsays1980, @peytondmurray, @KepingYan, @bryant1410, @woshiyyya, @sven1977