Ray-2.3.0
Release Highlights
- The streaming backend for Ray Datasets is in Developer Preview. It is designed to enable terabyte-scale ML inference and training workloads. Please contact us if you'd like to try it out on your workload, or you can find the preview guide here: https://docs.google.com/document/d/1BXd1cGexDnqHAIVoxTnV3BV0sklO9UXqPwSdHukExhY/edit
- New Information Architecture (Beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities.
- Ray-on-Spark is now available (Preview)!: You can launch Ray clusters on Databricks and Spark clusters and run Ray applications. Check out the documentation to learn more.
Ray Libraries
Ray AIR
💫Enhancements:
- Add
set_preprocessor
method toCheckpoint
(#31721) - Rename Keras callback and its parameters to be more descriptive (#31627)
- Deprecate MlflowTrainableMixin in favor of setup_mlflow() function (#31295)
- W&B
- Have train_loop_config logged as a config (#31901)
- Allow users to exclude config values with WandbLoggerCallback (#31624)
- Rename WandB
save_checkpoints
toupload_checkpoints
(#31582) - Add hook to get project/group for W&B integration (#31035, 31643)
- Use Ray actors instead of multiprocessing for WandbLoggerCallback (#30847)
- Update
WandbLoggerCallback
example (#31625)
- Predictor
- Checkpoints
🔨 Fixes:
- Fix and improve support for HDFS remote storage. (#31940)
- Use specified Preprocessor configs when using stream API. (#31725)
- Support nested Chain in BatchPredictor (#31407)
📖Documentation:
- Restructure API References (#32535)
- API Deprecations (#31777, #31867)
- Various fixes to docstrings, documentation, and examples (#30782, #30791)
🏗 Architecture refactoring:
- Use NodeAffinitySchedulingPolicy for scheduling (#32016)
- Internal resource management refactor (#30777, #30016)
Ray Data Processing
🎉 New Features:
- Lazy execution by default (#31286)
- Introduce streaming execution backend (#31579)
- Introduce DatasetIterator (#31470)
- Add per-epoch preprocessor (#31739)
- Add TorchVisionPreprocessor (#30578)
- Persist Dataset statistics automatically to log file (#30557)
💫Enhancements:
- Async batch fetching for map_batches (#31576)
- Add informative progress bar names to map_batches (#31526)
- Provide an size bytes estimate for mongodb block (#31930)
- Add support for dynamic block splitting to actor pool (#31715)
- Improve str/repr of Dataset to include execution plan (#31604)
- Deal with nested Chain in BatchPredictor (#31407)
- Allow MultiHotEncoder to encode arrays (#31365)
- Allow specify batch_size when reading Parquet file (#31165)
- Add zero-copy batch API for
ds.map_batches()
(#30000) - Text dataset should save texts in ArrowTable format (#30963)
- Return ndarray dicts for single-column tabular datasets (#30448)
- Execute randomize_block_order eagerly if it's the last stage for ds.schema() (#30804)
🔨 Fixes:
- Don't drop first dataset when peeking DatasetPipeline (#31513)
- Handle np.array(dtype=object) constructor for ragged ndarrays (#31670)
- Emit warning when starting Dataset execution with no CPU resources available (#31574)
- Fix the bug of eagerly clearing up input blocks (#31459)
- Fix Imputer failing with categorical dtype (#31435)
- Fix schema unification for Datasets with ragged Arrow arrays (#31076)
- Fix Discretizers transforming ignored cols (#31404)
- Fix to_tf when the input feature_columns is a list. (#31228)
- Raise error message if user calls Dataset.iter (#30575)
📖Documentation:
Ray Train
🎉 New Features:
- Add option for per-epoch preprocessor (#31739)
💫Enhancements:
- Change default
NCCL_SOCKET_IFNAME
to blacklistveth
(#31824) - Introduce DatasetIterator for bulk and streaming ingest (#31470)
- Clarify which
RunConfig
is used when there are multiple places to specify it (#31959) - Change
ScalingConfig
to be optional forDataParallelTrainer
s if already in Tunerparam_space
(#30920)
🔨 Fixes:
- Use specified
Preprocessor
configs when using stream API. (#31725) - Fix off-by-one AIR Trainer checkpoint ID indexing on restore (#31423)
- Force GBDTTrainer to use distributed loading for Ray Datasets (#31079)
- Fix bad case in ScalingConfig->RayParams (#30977)
- Don't raise TuneError on
fail_fast="raise"
(#30817) - Report only once in
SklearnTrainer
(#30593) - Ensure GBDT PGFs match passed ScalingConfig (#30470)
📖Documentation:
- Restructure API References (#32535)
- Remove Ray Client references from Train docs/examples (#32321)
- Various fixes to docstrings, documentation, and examples (#29463, #30492, #30543, #30571, #30782, #31692, #31735)
🏗 Architecture refactoring:
- API Deprecations (#31763)
Ray Tune
💫Enhancements:
- Improve trainable serialization error (#31070)
- Add support for Nevergrad optimizer with extra parameters (#31015)
- Add timeout for experiment checkpoint syncing to cloud (#30855)
- Move
validate_upload_dir
to Syncer (#30869) - Enable experiment restore from moved cloud uri (#31669)
- Save and restore stateful callbacks as part of experiment checkpoint (#31957)
🔨 Fixes:
- Do not default to reuse_actors=True when mixins are used (#31999)
- Only keep cached actors if search has not ended (#31974)
- Fix best trial in ProgressReporter with nan (#31276)
- Make ResultGrid return cloud checkpoints (#31437)
- Wait for final experiment checkpoint sync to finish (#31131)
- Fix CheckpointConfig validation for function trainables (#31255)
- Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable (#31231)
- Fix
AxSearch
save and nan/inf result handling (#31147) - Fix
AxSearch
search space conversion for fixed list hyperparameters (#31088) - Restore searcher and scheduler properly on
Tuner.restore
(#30893) - Fix progress reporter
sort_by_metric
with nested metrics (#30906) - Don't raise TuneError on
fail_fast="raise"
(#30817) - Fix duplicate printing when trial is done (#30597)
📖Documentation:
- Restructure API references (#32449)
- Remove Ray Client references from Tune docs/examples (#32321)
- Various fixes to docstrings, documentation, and examples (#29581, #30782, #30571, #31045, #31793, #32505)
🏗 Architecture refactoring:
- Deprecate passing a custom trial executor (#31792)
- Move signal handling into separate method (#31004)
- Update staged resources in a fixed counter for faster lookup (#32087)
- Rename
overwrite_trainable
argument in Tuner restore totrainable
(#32059)
Ray Serve
🎉 New Features:
- Serve python API to support multi application (#31589)
💫Enhancements:
- Add exponential backoff when retrying replicas (#31436)
- Enable Log Rotation on Serve (#31844)
- Use tasks/futures for asyncio.wait (#31608)
- Change target_num_ongoing_requests_per_replica to positive float (#31378)
🔨 Fixes:
- Upgrade deprecated calls (#31839)
- Change Gradio integration to take a builder function to avoid serialization issues (#31619)
- Add initial health check before marking a replica as RUNNING (#31189)
📖Documentation:
RLlib
🎉 New Features:
- Gymnasium is now supported. (Notes)
- Connectors are now activated by default (#31693, 30388, 31618, 31444, 31092)
- Contribution of LeelaChessZero algorithm for playing chess in a MultiAgent env. (#31480)
💫Enhancements:
- [RLlib] Error out if action_dict is empty in MultiAgentEnv. (#32129)
- [RLlib] Upgrade tf eager code to no longer use
experimental_relax_shapes
(butreduce_retracing
instead). (#29214) - [RLlib] Reduce SampleBatch counting complexity (#30936)
- [RLlib] Use PyTorch vectorized max() and sum() in SampleBatch.init when possible (#28388)
- [RLlib] Support multi-gpu CQL for torch (tf already supported). (#31466)
- [RLlib] Introduce IMPALA off_policyness test with GPU (#31485)
- [RLlib] Properly serialize and restore StateBufferConnector states for policy stashing (#31372)
- [RLlib] Clean up deprecated concat_samples calls (#31391)
- [RLlib] Better support MultiBinary spaces by treating Tuples as superset of them in ComplexInputNet. (#28900)
- [RLlib] Add backward compatibility to MeanStdFilter to restore from older checkpoints. (#30439)
- [RLlib] Clean up some signatures for compute_actions. (#31241)
- [RLlib] Simplify logging configuration. (#30863)
- [RLlib] Remove native Keras Models. (#30986)
- [RLlib] Convert PolicySpec to a readable format when converting to_dict(). (#31146)
- [RLlib] Issue 30394: Add proper
__str__()
method to PolicyMap. (#31098) - [RLlib] Issue 30840: Option to only checkpoint policies that are trainable. (#31133)
- [RLlib] Deprecate (delete)
contrib
folder. (#30992) - [RLlib] Better behavior if user does not specify stopping condition in RLLib CLI. (#31078)
- [RLlib] PolicyMap LRU cache enhancements: Swap out policies (instead of GC'ing and recreating) + use Ray object store (instead of file system). (#29513)
- [RLlib]
AlgorithmConfig.overrides()
to replacemultiagent->policies->config
andevaluation_config
dicts. (#30879) - [RLlib]
deprecation_warning(.., error=True)
should raiseValueError
, notDeprecationWarning
. (#30255) - [RLlib] Add
gym.spaces.Text
serialization. (#30794) - [RLlib] Convert
MultiAgentBatch
toSampleBatch
in offline_rl.py. (#30668) - [RLlib; Tune] Make
Algorithm.train()
return Tune-style config dict (instead of AlgorithmConfig object). (#30591)
🔨 Fixes:
- [RLlib] Fix waterworld example and test (#32117)
- [RLlib] Change Waterworld v3 to v4 and reinstate indep. MARL test case w/ pettingzoo. (#31820)
- [RLlib] Fix OPE checkpointing. Save method name in configuration dict. (#31778)
- [RLlib] Fix worker state restoration. (#31644)
- [RLlib] Replace ordinary pygame imports by
try_import_..()
. (#31332) - [RLlib] Remove crude VR checks in agent collector. (#31564)
- [RLlib] Fixed the 'RestoreWeightsCallback' example script. (#31601)
- [RLlib] Issue 28428: QMix not working w/ GPUs. (#31299)
- [RLlib] Fix using yaml files with empty stopping conditions. (#31363)
- [RLlib] Issue 31174: Move all checks into AlgorithmConfig.validate() (even simple ones) to avoid errors when using tune hyperopt objects. (#31396)
- [RLlib] Fix
tensorflow_probability
imports. (#31331) - [RLlib] Issue 31323: BC/MARWIL/CQL do work with multi-GPU (but config validation prevents them from running in this mode). (#31393)
- [RLlib] Issue 28849: DT fails with num_gpus=1. (#31297)
- [RLlib] Fix
PolicyMap.__del__()
to also remove a deleted policy ID from the internal deque. (#31388) - [RLlib] Use
get_model_v2()
instead ofget_model()
with MADDPG. (#30905) - [RLlib] Policy mapping fn can not be called with keyword arguments. (#31141)
- [RLlib] Issue 30213: Appending RolloutMetrics to sampler outputs should happen after(!) all callbacks (such that custom metrics for last obs are still included). (#31102)
- [RLlib] Make convert_to_torch tensor adhere to docstring. (#31095)
- [RLlib] Fix convert to torch tensor (#31023)
- [RLlib] Issue 30221: random policy does not handle nested spaces. (#31025)
- [RLlib] Fix crashing remote envs example (#30562)
- [RLlib] Recursively look up the original space from obs_space (#30602)
📖Documentation:
- [RLlib; docs] Change links and references in code and docs to "Farama foundation's gymnasium" (from "OpenAI gym"). (#32061)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Task Events Backend: Ray aggregates all submitted task information to provide better observability (#31840, #31761, #31278, #31247, #31316, #30934, #30979, #31207, #30867, #30829, #31524, #32157). This will back up features like task state API, advanced progress bar, and Ray timeline.
💫Enhancements:
- Remote generator now works for ray actors and ray clients (#31700, #31710).
- Revamp default scheduling strategy, improve worker startup performance up to 8x for embarrassingly parallel workloads (#31934, #31868).
- Worker code clean up and allow workers lazy bind to jobs (#31836, #31846, #30349, #31375).
- A single Ray cluster can scale up to 2000 nodes and 20k actors(#32131, #30131, #31939, #30166, #30460, #30563).
- Out-of-memory prevention enhancement is now GA with more robust worker killing policies and better user experiences (#32217, #32361, #32219, #31768, #32107, #31976, #31272, #31509, #31230).
🔨 Fixes:
- Improve garbage collection upon job termination (#32127, #31155)
- Fix opencensus protobuf bug (#31632)
- Support python 3.10 for runtime_env conda (#30970)
- Fix crashes and memory leaks (#31640, #30476, #31488, #31917, #30761, #31018)
📖Documentation:
Ray Clusters
🎉 New Features:
💫Enhancements:
- [observability] Better memory formatting for
ray status
and autoscaler (#32337) - [autoscaler] Add flag to disable periodic cluster status log. (#31869)
🔨 Fixes:
- [observability][autoscaler] Ensure pending nodes is reset to 0 after scaling (#32085)
- Make ~/.bashrc optional in cluster launcher commands (#32393)
📖Documentation:
- Improvements to job submission
- Remove references to Ray Client
Dashboard
🎉 New Features:
- New Information Architecture (beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities. For developers, the jobs and actors tab will be most useful. For infrastructure engineers, the cluster tab may be more valuable.
- Advanced progress bar: Tasks visualization that allow you to see the progress of all your ray tasks
- Timeline view: We’ve added a button to download detailed timeline data about your ray job. Then, one can click a link and use the perfetto open-source visualization tool to visualize the timeline data.
- More metadata tables. You can now see placement groups, tasks, actors, and other information related to your jobs.
📖Documentation:
- We’ve restructured the documentation to make the dashboard documentation more prominent
- We’ve improved the documentation around setting up Prometheus and Grafana for metrics.
Many thanks to all those who contributed to this release!
@minerharry, @scottsun94, @iycheng, @DmitriGekhtman, @jbedorf, @krfricke, @simonsays1980, @eltociear, @xwjiang2010, @ArturNiederfahrenhorst, @richardliaw, @avnishn, @WeichenXu123, @Capiru, @davidxia, @andreapiso, @amogkam, @sven1977, @scottjlee, @kylehh, @yhna940, @rickyyx, @sihanwang41, @n30111, @Yard1, @sriram-anyscale, @Emiyalzn, @simran-2797, @cadedaniel, @harelwa, @ijrsvt, @clarng, @pabloem, @bveeramani, @lukehsiao, @angelinalg, @dmatrix, @sijieamoy, @simon-mo, @jbesomi, @YQ-Wang, @larrylian, @c21, @AndreKuu, @maxpumperla, @architkulkarni, @wuisawesome, @justinvyu, @zhe-thoughts, @matthewdeng, @peytondmurray, @kevin85421, @tianyicui-tsy, @cassidylaidlaw, @gvspraveen, @scv119, @kyuyeonpooh, @Siraj-Qazi, @jovany-wang, @ericl, @shrekris-anyscale, @Catch-Bull, @jianoaix, @christy, @MisterLin1995, @kouroshHakha, @pcmoritz, @csko, @gjoliver, @clarkzinzow, @SongGuyang, @ckw017, @ddelange, @alanwguo, @Dhul-Husni, @Rohan138, @rkooo567, @fzyzcjy, @chaokunyang, @0x2b3bfa0, @zoltan-fedor, @Chong-Li, @crypdick, @jjyao, @emmyscode, @stephanie-wang, @starpit, @smorad, @nikitavemuri, @zcin, @tbukic, @ayushthe1, @mattip