Ray-2.11.0

aslonnie released this 17 Apr 23:31

· 1680 commits to master since this release

Release Highlights

[data] Support reading Avro files with ray.data.read_avro
[train] Added experimental support for AWS Trainium (Neuron) and Intel HPU.

Ray Libraries

Ray Data

🎉 New Features:

Support reading Avro files with ray.data.read_avro (#43663)

💫 Enhancements:

Pin ipywidgets==7.7.2 to enable Data progress bars in VSCode Web (#44398)
Change log level for ignored exceptions (#44408)

🔨 Fixes:

Change Parquet encoding ratio lower bound from 2 to 1 (#44470)
Fix throughput time calculations for metrics (#44138)
Fix nested ragged numpy.ndarray (#44236)
Fix Ray debugger incompatibility caused by trimmed error stack trace (#44496)

📖 Documentation:

Update "Data Loading and Preprocessing" doc (#44165)
Move imports into TFPRedictor in batch inference example (#44434)

Ray Train

🎉 New Features:

Add experimental support for AWS Trainium (Neuron) (#39130)
Add experimental support for Intel HPU (#43343)

💫 Enhancements:

Log a deprecation warning for local_dir and related environment variables (#44029)
Enforce xgboost>=1.7 for XGBoostTrainer usage (#44269)

🔨 Fixes:

Fix ScalingConfig(accelerator_type) to request an appropriate resource amount (#44225)
Fix maximum recursion issue when serializing exceptions (#43952)
Remove base config deepcopy when initializing the trainer actor (#44611)

🏗 Architecture refactoring:

Remove deprecated BatchPredictor (#43934)

Ray Tune

💫 Enhancements:

Add support for new style lightning import (#44339)
Log a deprecation warning for local_dir and related environment variables (#44029)

🏗 Architecture refactoring:

Remove scikit-optimize search algorithm (#43969)

Ray Serve

🔨 Fixes:

Dynamically-created applications will no longer be deleted when a config is PUT via the REST API (#44476).
Fix _to_object_ref memory leak (#43763)
Log warning to reconfigure max_ongoing_requests if max_batch_size is less than max_ongoing_requests (#43840)
Deployment fails to start with ModuleNotFoundError in Ray 3.10 (#44329)
- This was fixed by reverting the original core changes on the sys.path behavior. Revert "[core] If there's working_dir, don't set _py_driver_sys_path." (#44435)
The batch_queue_cls parameter is removed from the @serve.batch decorator (#43935)

RLlib

🎉 New Features:

New API stack: DQN Rainbow is now available for single-agent (#43196, #43198, #43199)
PrioritizedEpisodeReplayBuffer is available for off-policy learning using the EnvRunner API (SingleAgentEnvRunner) and supports random n-step sampling (#42832, #43258, #43458, #43496, #44262)

💫 Enhancements:

Restructured examples/ folder; started moving example scripts to the new API stack (#44559, #44067, #44603)
Evaluation do-over: Deprecate enable_async_evaluation option (in favor of existing evaluation_parallel_to_training setting). (#43787)
Add: module_for API to MultiAgentEpisode (analogous to policy_for API of the old Episode classes). (#44241)
All rllib_contrib old stack algorithms have been removed from rllib/algorithms (#43656)

🔨 Fixes:

New API stack: Multi-GPU + multi-agent has been fixed. This completes support for any combinations of the following on the new API stack: [single-agent, multi-agent] vs [0 GPUs, 1 GPU, >1GPUs] vs [any number of EnvRunners] (#44420, #44664, #44594, #44677, #44082, #44669, #44622)
Various other bug fixes: #43906, #43871, #44000, #44340, #44491, #43959, #44043, #44446, #44040

📖 Documentation:

Re-announced new API stack in alpha stage (#44090).

Ray Core and Ray Clusters

🎉 New Features:

Added Ray check-open-ports CLI for checking potential open ports to the public (#44488)

💫 Enhancements:

Support nodes sharing the same spilling directory without conflicts. (#44487)
Create two subclasses of RayActorError to distinguish between actor died (ActorDiedError) and actor temporarily unavailable (ActorUnavailableError) cases.

🔨 Fixes:

Fixed the ModuleNotFound issued introduced in 2.10 (#44435)
Fixed an issue where agent process is using too much CPU (#44348)
Fixed race condition in multi-threaded actor creation (#44232)
Fixed several streaming generator bugs (#44079, #44257, #44197)
Fixed an issue where user exception raised from tasks cannot be subclassed (#44379)

Dashboard

💫 Enhancements:

Add serve controller metrics to serve system dashboard page (#43797)
Add Serve Application rows to Serve top-level deployments details page (#43506)
[Actor table page enhancements] Include "NodeId", "CPU", "Memory", "GPU", "GRAM" columns in the actor table page. Add sort functionality to resource utilization columns. Enable searching table by "Class" and "Repr". (#42588) (#42633) (#42788)

🔨 Fixes:

Fix default sorting of nodes in Cluster table page to first be by "Alive" nodes, then head nodes, then alphabetical by node ID. (#42929)
Fix bug where the Serve Deployment detail page fails to load if the deployment is in "Starting" state (#43279)

Docs

💫 Enhancements:

Landing page refreshes its look and feel. (#44251)

Thanks

Many thanks to all those who contributed to this release!

@aslonnie, @brycehuang30, @MortalHappiness, @astron8t-voyagerx, @edoakes, @sven1977, @anyscalesam, @scottjlee, @hongchaodeng, @slfan1989, @hebiao064, @fishbone, @zcin, @GeneDer, @shrekris-anyscale, @kira-lin, @chappidim, @raulchen, @c21, @WeichenXu123, @marian-code, @bveeramani, @can-anyscale, @mjd3, @justinvyu, @jackhumphries, @Bye-legumes, @ashione, @alanwguo, @Dreamsorcerer, @KamenShah, @jjyao, @omatthew98, @autolisis, @Superskyyy, @stephanie-wang, @simonsays1980, @davidxia, @angelinalg, @architkulkarni, @chris-ray-zhang, @kevin85421, @rynewang, @peytondmurray, @zhangyilun, @khluu, @matthewdeng, @ruisearch42, @pcmoritz, @mattip, @jerome-habana, @alexeykudinkin

Contributors

pcmoritz, alexeykudinkin, and 50 other contributors

Assets 2