Ray-2.2.0
Release Highlights
Ray 2.2 is a stability-focused release, featuring stability improvements across many Ray components.
- Ray Jobs API is now GA. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application.
- Ray Dashboard has received a number of improvements, such as the ability to see cpu flame graphs of your Ray workers and new metrics for memory usage.
- The Out-Of-Memory (OOM) Monitor is now enabled by default. This will increase the stability of memory-intensive applications on top of Ray.
- [Ray Data] we’ve heard numerous users report that when files are too large, Ray Data can have out-of-memory or performance issues. In this release, we’re enabling dynamic block splitting by default, which will address the above issues by avoiding holding too much data in memory.
Ray Libraries
Ray AIR
🎉 New Features:
- Add a NumPy first path for Torch and TensorFlow Predictors (#28917)
💫Enhancements:
- Suppress "NumPy array is not writable" error in torch conversion (#29808)
- Add node rank and local world size info to session (#29919)
🔨 Fixes:
- Fix MLflow database integrity error (#29794)
- Fix ResourceChangingScheduler dropping PlacementGroupFactory args (#30304)
- Fix bug passing 'raise' to FailureConfig (#30814)
- Fix reserved CPU warning if no CPUs are used (#30598)
📖Documentation:
- Fix examples and docs to specify batch_format in BatchMapper (#30438)
🏗 Architecture refactoring:
- Deprecate Wandb mixin (#29828)
- Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#30365)
Ray Data Processing
🎉 New Features:
- Support all PyArrow versions released by Apache Arrow (#29993, #29999)
- Add
select_columns()
to select a subset of columns (#29081) - Add
write_tfrecords()
to write TFRecord files (#29448) - Support MongoDB data source (#28550)
- Enable dynamic block splitting by default (#30284)
- Add
from_torch()
to create dataset from Torch dataset (#29588) - Add
from_tf()
to create dataset from TensorFlow dataset (#29591) - Allow to set
batch_size
inBatchMapper
(#29193) - Support read/write from/to local node file system (#29565)
💫Enhancements:
- Add
include_paths
inread_images()
to return image file path (#30007) - Print out Dataset statistics automatically after execution (#29876)
- Cast tensor extension type to opaque object dtype in
to_pandas()
andto_dask()
(#29417) - Encode number of dimensions in variable-shaped tensor extension type (#29281)
- Fuse AllToAllStage and OneToOneStage with compatible remote args (#29561)
- Change
read_tfrecords()
output from Pandas to Arrow format (#30390) - Handle all Ray errors in task compute strategy (#30696)
- Allow nested Chain preprocessors (#29706)
- Warn user if missing columns and support
str
exclude inConcatenator
(#29443) - Raise ValueError if preprocessor column doesn't exist (#29643)
🔨 Fixes:
- Support custom resource with remote args for
random_shuffle()
(#29276) - Support custom resource with remote args for
random_shuffle_each_window()
(#29482) - Add PublicAPI annotation to preprocessors (#29434)
- Tensor extension column concatenation fixes (#29479)
- Fix
iter_batches()
to not return empty batch (#29638) - Change
map_batches()
to fetch input blocks on-demand (#29289) - Change
take_all()
to not accept limit argument (#29746) - Convert between block and batch correctly for
map_groups()
(#30172) - Fix
stats()
call causing Dataset schema to be unset (#29635) - Raise error when
batch_format
is not specified forBatchMapper
(#30366) - Fix ndarray representation of single-element ragged tensor slices (#30514)
📖Documentation:
- Improve
map_batches()
documentation about execution model and UDF pickle-ability requirement (#29233) - Improve
to_tf()
docstring (#29464)
Ray Train
🎉 New Features:
💫Enhancements:
🔨 Fixes:
- Propagate DatasetContext to training workers (#29192)
- Show correct error message on training failure (#29908)
- Fix prepare_data_loader with enable_reproducibility (#30266)
- Fix usage of NCCL_BLOCKING_WAIT (#29562)
📖Documentation:
- Deduplicate Train examples (#29667)
🏗 Architecture refactoring:
- Hard deprecate train.report (#29613)
- Remove deprecated Train modules (#29960)
- Deprecate old prepare_model DDP args #30364
Ray Tune
🎉 New Features:
- Make
Tuner.restore
work with relative experiment paths (#30363) Tuner.restore
from a local directory that has moved (#29920)
💫Enhancements:
with_resources
takes in aScalingConfig
(#30259)- Keep resource specifications when nesting
with_resources
inwith_parameters
(#29740) - Add
trial_name_creator
andtrial_dirname_creator
toTuneConfig
(#30123) - Add option to not override the working directory (#29258)
- Only convert a
BaseTrainer
toTrainable
once in the Tuner (#30355) - Dynamically identify PyTorch Lightning Callback hooks (#30045)
- Make
remote_checkpoint_dir
work with query strings (#30125) - Make cloud checkpointing retry configurable (#30111)
- Sync experiment-checkpoints more often (#30187)
- Update generate_id algorithm (#29900)
🔨 Fixes:
- Catch SyncerCallback failure with dead node (#29438)
- Do not warn in BayesOpt w/ Uniform sampler (#30350)
- Fix
ResourceChangingScheduler
dropping PGF args (#30304) - Fix Jupyter output with Ray Client and
Tuner
(#29956) - Fix tests related to
TUNE_ORIG_WORKING_DIR
env variable (#30134)
📖Documentation:
- Add user guide for analyzing results (using
ResultGrid
andResult
) (#29072) - Tune checkpointing and Tuner restore docfix (#29411)
- Fix and clean up PBT examples (#29060)
- Fix TrialTerminationReporter in docs (#29254)
🏗 Architecture refactoring:
- Remove hard deprecated SyncClient/Syncer (#30253)
- Deprecate Wandb mixin, move to
setup_wandb()
function (#29828)
Ray Serve
🎉 New Features:
💫Enhancements:
🔨 Fixes:
- Fix log format error (#28760)
- Inherit previous deployment num_replicas (29686)
- Polish serve run deploy message (#29897)
- Remove calling of get_event_loop from python 3.10
RLlib
🎉 New Features:
- Fault tolerant, elastic WorkerSets: An asynchronous Ray Actor manager class is now used inside all of RLlib’s Algorithms, adding fully flexible fault tolerance to rollout workers and workers used for evaluation. If one or more workers (which are Ray actors) fails - e.g. due to a SPOT instance going down - the RLlib Algorithm will now flexibly wait it out and periodically try to recreate the failed workers. In the meantime, only the remaining healthy workers are used for sampling and evaluation. (#29938, #30118, #30334, #30252, #29703, #30183, #30327, #29953)
💫Enhancements:
- RLlib CLI: A new and enhanced RLlib command line interface (CLI) has been added, allowing for automatically downloading example configuration files, python-based config files (defining an AlgorithmConfig object to use), better interoperability between training and evaluation runs, and many more. For a detailed overview of what has changed, check out the new CLI documentation. (#29204, #29459, #30526, #29661, #29972)
- Checkpoint overhaul: Algorithm checkpoints and Policy checkpoints are now more cohesive and transparent. All checkpoints are now characterized by a directory (with files and maybe sub-directories), rather than a single pickle file; Both Algorithm and Policy classes now have a utility static method (
from_checkpoint()
) for directly instantiating instances from a checkpoint directory w/o knowing the original configuration used or any other information (having the checkpoint is sufficient). For a detailed overview, see here. (#28812, #29772, #29370, #29520, #29328) - A new metric for APPO/IMPALA/PPO has been added that measures off-policy’ness: The difference in number of grad-updates the sampler policy has received thus far vs the trained policy’s number of grad-updates thus far. (#29983)
🏗 Architecture refactoring:
- AlgorithmConfig classes: All of RLlib’s Algorithms, RolloutWorkers, and other important classes now use AlgorithmConfig objects under the hood, instead of python config dicts. It is no longer recommended (however, still supported) to create a new algorithm (or a Tune+RLlib experiment) using a python dict as configuration. For more details on how to convert your scripts to the new AlgorithmConfig design, see here. (#29796, #30020, #29700, #29799, #30096, #29395, #29755, #30053, #29974, #29854, #29546, #30042, #29544, #30079, #30486, #30361)
- Major progress was made on the new Connector API and making sure it can be used (tentatively) with the “config.rollouts(enable_connectors=True)” flag. Will be fully supported, across all of RLlib’s algorithms, in Ray 2.3. (#30307, #30434, #30459, #30308, #30332, #30320, #30383, #30457, #30446, #30024, #29064, #30398, #29385, #30481, #30241, #30285, #30423, #30288, #30313, #30220, #30159)
- Progress was made on the upcoming RLModule/RLTrainer/RLOptimizer APIs. (#30135, #29600, #29599, #29449, #29642)
🔨 Fixes:
- Various bug fixes: #25925, #30279, #30478, #30461, #29867, #30099, #30185, #29222, #29227, #29494, #30257, #29798, #30176, #29648, #30331
📖Documentation:
- RLlib CLI, Checkpoint overhaul, AlgorithmConfigs
- Minor fixes: #29261, #29752
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Out-of-memory monitor is now Beta and is enabled by default.
💫Enhancements:
- The Ray Jobs API has graduated from Beta to GA. This means Ray Jobs will maintain API backward compatibility.
- Run Ray job entrypoint commands (“driver scripts”) on worker nodes by specifying
entrypoint_num_cpus
,entrypoint_num_gpus
, orentrypoint_resources
. (#28564, #28203) - (Beta) OpenAPI spec for Ray Jobs REST API (#30417)
- Improved Ray health checking mechanism. The fix will reduce the frequency of GCS marking raylets fail mistakenly when it is overloaded. (#29346, #29442, #29389, #29924)
🔨 Fixes:
- Various fixes for hanging / deadlocking (#29491, #29763, #30371, #30425)
- Set OMP_NUM_THREADS to
num_cpus
required by task/actors by default (#30496) - set worker non recyclable if gpu is envolved by default (#30061)
📖Documentation:
- General improvements of Ray Core docs, including design patterns and tasks.
Ray Clusters
💫Enhancements:
Dashboard
🎉 New Features:
- Additional improvements from the default metrics dashboard. We now have actor, placement group, and per component memory usage breakdown. You can see details from the doc.
- New profiling feature using py-spy under the hood. You can click buttons to see stack trace or cpu flame graphs of your workers.
- Autoscaler and job events are available from the dashboard. You can also access the same data using
ray list cluster-events
.
🔨 Fixes:
- Stability improvements from the dashboard
- Dashboard now works at large scale cluster! It is tested with 250 nodes and 10K+ actors (which matches the Ray scalability envelope).
- Smarter api fetching logic. We now wait for the previous API to finish before sending a new API request when polling for new data.
- Fix agent memory leak and high CPU usage.
💫Enhancements:
- General improvements to the progress bar. You can now see progress bars for each task name if you drill into the job details.
- More metadata is available in the jobs and actors tables.
- There is now a feedback button embedded into the dashboard. Please submit any bug reports or suggestions!
Many thanks to all those who contributed to this release!
@shrekris-anyscale, @rickyyx, @scottjlee, @shogohida, @liuyang-my, @matthewdeng, @wjrforcyber, @linusbiostat, @clarkzinzow, @justinvyu, @zygi, @christy, @amogkam, @cool-RR, @jiaodong, @EvgeniiTitov, @jjyao, @ilee300a, @jianoaix, @rkooo567, @mattip, @maxpumperla, @ericl, @cadedaniel, @bveeramani, @rueian, @stephanie-wang, @lcipolina, @bparaj, @JoonHong-Kim, @avnishn, @tomsunelite, @larrylian, @alanwguo, @VishDev12, @c21, @dmatrix, @xwjiang2010, @thomasdesr, @tiangolo, @sokratisvas, @heyitsmui, @scv119, @pcmoritz, @bhavika, @yzs981130, @andraxin, @Chong-Li, @clarng, @acxz, @ckw017, @krfricke, @kouroshHakha, @sijieamoy, @iycheng, @gjoliver, @peytondmurray, @xcharleslin, @DmitriGekhtman, @andreichalapco, @vitrioil, @architkulkarni, @simon-mo, @ArturNiederfahrenhorst, @sihanwang41, @pabloem, @sven1977, @avivhaber, @wuisawesome, @jovany-wang, @Yard1