Ray-2.1.0
Release Highlights
- Ray AI Runtime (AIR)
- Better support for Image-based workloads.
- Ray Datasets
read_images()
API for loading data. - Numpy-based API for user-defined functions in Preprocessor.
- Ray Datasets
- Ability to read TFRecord input.
- Ray Datasets
read_tfrecords()
API to read TFRecord files.
- Ray Datasets
- Better support for Image-based workloads.
- Ray Serve:
- Add support for gRPC endpoint (alpha release). Instead of using an HTTP server, Ray Serve supports gRPC protocol and users can bring their own schema for their use case.
- RLlib:
- Introduce decision transformer (DT) algorithm.
- New hook for callbacks with
on_episode_created()
. - Learning rate schedule to SimpleQ and PG.
- Ray Core:
- Ray OOM prevention (alpha release).
- Support dynamic generators as task return values.
- Dashboard:
- Time series metrics support.
- Export configuration files can be used in Prometheus or Grafana instances.
- New progress bar in job detail view.
Ray Libraries
Ray AIR
💫Enhancements:
- Improve readability of training failure output (#27946, #28333, #29143)
- Auto-enable GPU for Predictors (#26549)
- Add ability to create TorchCheckpoint from state dict (#27970)
- Add ability to create TensorflowCheckpoint from saved model/h5 format (#28474)
- Add attribute to retrieve URI from Checkpoint (#28731)
- Add all allowable types to WandB Callback (#28888)
🔨 Fixes:
- Handle nested metrics properly as scoring attribute (#27715)
- Fix serializability of Checkpoints (#28387, #28895, #28935)
📖Documentation:
- Miscellaneous updates to documentation and examples (#28067, #28002, #28189, #28306, #28361, #28364, #28631, #28800)
🏗 Architecture refactoring:
- Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#28318)
- Deprecate legacy train/tune functions in favor of Session (#28856)
Ray Data Processing
🎉 New Features:
- Add read_images (#29177)
- Add read_tfrecords (#28430)
- Add NumPy batch format to Preprocessor and
BatchMapper
(#28418) - Ragged tensor extension type (#27625)
- Add KBinsDiscretizer Preprocessor (#28389)
💫Enhancements:
- Simplify to_tf interface (#29028)
- Add metadata override and inference in
Dataset.to_dask()
(#28625) - Prune unused columns before aggregate (#28556)
- Add Dataset.default_batch_format (#28434)
- Add partitioning parameter to read_ functions (#28413)
- Deprecate "native" batch format in favor of "default" (#28489)
- Support None partition field name (#28417)
- Re-enable Parquet sampling and add progress bar (#28021)
- Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964)
- Customized serializer for Arrow JSON ParseOptions in read_json (#27911)
- Optimize groupby/mapgroups performance (#27805)
- Improve size estimation of image folder data source (#27219)
- Use detached lifetime for stats actor (#25271)
- Pin _StatsActor to the driver node (#27765)
- Better error message for partition filtering if no file found (#27353)
- Make Concatenator deterministic (#27575)
- Change FeatureHasher input schema to expect token counts (#27523)
- Avoid unnecessary reads when truncating a dataset with
ds.limit()
(#27343) - Hide tensor extension from UDFs (#27019)
- Add repr to AIR classes (#27006)
🔨 Fixes:
- Add upper bound to pyarrow version check (#29674) (#29744)
- Fix map_groups to work with different output type (#29184)
- read_csv not filter out files by default (#29032)
- Check columns when adding rows to TableBlockBuilder (#29020)
- Fix the peak memory usage calculation (#28419)
- Change sampling to use same API as read Parquet (#28258)
- Fix column assignment in Concatenator for Pandas 1.2. (#27531)
- Doing partition filtering in reader constructor (#27156)
- Fix split ownership (#27149)
📖Documentation:
- Clarify dataset transformation. (#28482)
- Update map_batches documentation (#28435)
- Improve docstring and doctest for read_parquet (#28488)
- Activate dataset doctests (#28395)
- Document using a different separator for read_csv (#27850)
- Convert custom datetime column when reading a CSV file (#27854)
- Improve preprocessor documentation (#27215)
- Improve
limit()
andtake()
docstrings (#27367) - Reorganize the tensor data support docs (#26952)
- Fix nyc_taxi_basic_processing notebook (#26983)
Ray Train
🎉 New Features:
- Add FullyShardedDataParallel support to TorchTrainer (#28096)
💫Enhancements:
- Add rich notebook repr for DataParallelTrainer (#26335)
- Fast fail if training loop raises an error on any worker (#28314)
- Use torch.encode_data with HorovodTrainer when torch is imported (#28440)
- Automatically set NCCL_SOCKET_IFNAME to use ethernet (#28633)
- Don't add Trainer resources when running on Colab (#28822)
- Support large checkpoints and other arguments (#28826)
🔨 Fixes:
- Fix and improve HuggingFaceTrainer (#27875, #28154, #28170, #28308, #28052)
- Maintain dtype info in LightGBMPredictor (#28673)
- Fix prepare_model (#29104)
- Fix
train.torch.get_device()
(#28659)
📖Documentation:
- Clarify LGBM/XGB Trainer documentation (#28122)
- Improve Hugging Face notebook example (#28121)
- Update Train API reference and docs (#28192)
- Mention FSDP in HuggingFaceTrainer docs (#28217)
🏗 Architecture refactoring:
- Improve Trainer modularity for extensibility (#28650)
Ray Tune
🎉 New Features:
- Add
Tuner.get_results()
to retrieve results after restore (#29083)
💫Enhancements:
- Exclude files in sync_dir_between_nodes, exclude temporary checkpoints (#27174)
- Add rich notebook output for Tune progress updates (#26263)
- Add logdir to W&B run config (#28454)
- Improve readability for long column names in table output (#28764)
- Add functionality to recover from latest available checkpoint (#29099)
- Add retry logic for restoring trials (#29086)
🔨 Fixes:
- Re-enable progress metric detection (#28130)
- Add timeout to retry_fn to catch hanging syncs (#28155)
- Correct PB2’s beta_t parameter implementation (#28342)
- Ignore directory exists errors to tackle race conditions (#28401)
- Correctly overwrite files on restore (#28404)
- Disable pytorch-lightning multiprocessing per default (#28335)
- Raise error if scheduling an empty PlacementGroupFactory#28445
- Fix trial cleanup after x seconds, set default to 600 (#28449)
- Fix trial checkpoint syncing after recovery from other node (#28470)
- Catch empty hyperopt search space, raise better Tuner error message (#28503)
- Fix and optimize sample search algorithm quantization logic (#28187)
- Support tune.with_resources for class methods (#28596)
- Maintain consistent Trial/TrialRunner state when pausing and resuming trial with PBT (#28511)
- Raise better error for incompatible gcsfs version (#28772)
- Ensure that exploited in-memory checkpoint is used by trial with PBT (#28509)
- Fix Tune checkpoint tracking for minimizing metrics (#29145)
📖Documentation:
- Miscelleanous documentation fixes (#27117, #28131, #28210, #28400, #28068, #28809)
- Add documentation around trial/experiment checkpoint (#28303)
- Add basic parallel execution guide for Tune (#28677)
- Add example PBT notebook (#28519)
🏗 Architecture refactoring:
- Store SyncConfig and CheckpointConfig in Experiment and Trial (#29019)
Ray Serve
🎉 New Features:
- Added gRPC direct ingress support [alpha version] (#28175)
- Serve cli can provide kubernetes formatted output (#28918)
- Serve cli can provide user config output without default value (#28313)
💫Enhancements:
- Enrich more benchmarks
- image objection with resnet50 mode with image preprocessing (#29096)
- gRPC vs HTTP inference performance (#28175)
- Add health check metrics to reflect the replica health status (#29154)
🔨 Fixes:
- Fix memory leak issues during inference (#29187)
- Fix unexpected http options omit warning when using serve cli to start the ray serve (#28257)
- Fix unexpected long poll exceptions (#28612)
📖Documentation:
- Add e2e fault tolerance instructions (#28721)
- Add Direct Ingress instructions (#29149)
- Bunch of doc improvements on “dev workflow”, “custom resources”, “serve cli” etc (#29147, #28708, #28529, #28527)
RLlib
🎉 New Features:
- Decision Transformer (DT) Algorithm added (#27890, #27889, #27872, #27829).
- Callbacks now have a new hook
on_episode_created()
. (#28600) - Added learning rate schedule to SimpleQ and PG. (#28381)
💫Enhancements:
- Soft target network update is now supported by all off-policy algorithms (e.g DQN, DDPG, etc.) (#28135)
- Stop RLlib from "silently" selecting atari preprocessors. (#29011)
- Improved offline RL and off-policy evaluation performance (#28837, #28834, #28593, #28420, #28136, #28013, #27356, #27161, #27451).
- Escalated old deprecation warnings to errors (#28807, #28795, #28733, #28697).
- Others: #27619, #27087.
🔨 Fixes:
- Various bug fixes: #29077, #28811, #28637, #27785, #28703, #28422, #28405, #28358, #27540, #28325, #28357, #28334, #27090, #28133, #27981, #27980, #26666, #27390, #27791, #27741, #27424, #27544, #27459, #27572, #27255, #27304, #26629, #28166, #27864, #28938, #28845, #28588, #28202, #28201, #27806
📖Documentation:
Ray Workflows
🔨 Fixes:
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Ray OOM prevention feature alpha release! If your Ray jobs suffer from OOM issues, please give it a try.
- Support dynamic generators as task return values. (#29082 #28864 #28291)
💫Enhancements:
- Fix spread scheduling imbalance issues (#28804 #28551 #28551)
- Widening range of grpcio versions allowed (#28623)
- Support encrypted redis connection. (#29109)
- Upgrade redis from 6.x to 7.0.5. (#28936)
- Batch ScheduleAndDispatchTasks calls (#28740)
🔨 Fixes:
- More robust spilled object deletion (#29014)
- Fix the initialization/destruction order between reference_counter_ and node change subscription (#29108)
- Suppress the logging error when python exits and actor not deleted (#27300)
- Mark
run_function_on_all_workers
as deprecated until we get rid of this (#29062) - Remove unused args for default_worker.py (#28177)
- Don't include script directory in sys.path if it's started via python -m (#28140)
- Handling edge cases of max_cpu_fraction argument (#27035)
- Fix out-of-band deserialization of actor handle (#27700)
- Allow reuse of cluster address if Ray is not running (#27666)
- Fix a uncaught exception upon deallocation for actors (#27637)
- Support placement_group=None in PlacementGroupSchedulingStrategy (#27370)
📖Documentation:
- Ray 2.0 white paper is published.
- Revamp ray core docs (#29124 #29046 #28953 #28840 #28784 #28644 #28345 #28113 #27323 #27303)
- Fix cluster docs (#28056 #27062)
- CLI Reference Documentation Revamp (#27862)
Ray Clusters
💫Enhancements:
- Distinguish Kubernetes deployment stacks (#28490)
📖Documentation:
- State intent to remove legacy Ray Operator (#29178)
- Improve KubeRay migration notes (#28672)
- Add FAQ for cluster multi-tenancy support (#29279)
Dashboard
🎉 New Features:
- Time series metrics are now built into the dashboard
- Ray now exports some default configuration files which can be used for your Prometheus or Grafana instances. This includes default metrics which show common information important to your Ray application.
- New progress bar is shown in the job detail view. You can see how far along your ray job is.
🔨 Fixes:
- Fix to prometheus exporter producing a slightly incorrect format.
- Fix several performance issues and memory leaks
📖Documentation:
- Added additional documentation on the new time series and the metrics page
Many thanks to all those who contributed to this release!
@sihanwang41, @simon-mo, @avnishn, @MyeongKim, @markrogersjr, @christy, @xwjiang2010, @kouroshHakha, @zoltan-fedor, @wumuzi520, @alanwguo, @Yard1, @liuyang-my, @charlesjsun, @DevJake, @matteobettini, @jonathan-conder-sm, @mgerstgrasser, @guidj, @JiahaoYao, @Zyiqin-Miranda, @jvanheugten, @aallahyar, @SongGuyang, @clarng, @architkulkarni, @Rohan138, @heyitsmui, @mattip, @ArturNiederfahrenhorst, @maxpumperla, @vale981, @krfricke, @DmitriGekhtman, @amogkam, @richardliaw, @maldil, @zcin, @jianoaix, @cool-RR, @kira-lin, @gramhagen, @c21, @jiaodong, @sijieamoy, @tupui, @ericl, @anabranch, @se4ml, @suquark, @dmatrix, @jjyao, @clarkzinzow, @smorad, @rkooo567, @jovany-wang, @edoakes, @XiaodongLv, @klieret, @rozsasarpi, @scottsun94, @ijrsvt, @bveeramani, @chengscott, @jbedorf, @kevin85421, @nikitavemuri, @sven1977, @acxz, @stephanie-wang, @PaulFenton, @WangTaoTheTonic, @cadedaniel, @nthai, @wuisawesome, @rickyyx, @artemisart, @peytondmurray, @pingsutw, @olipinski, @davidxia, @stestagg, @yaxife, @scv119, @mwtian, @yuanchi2807, @ntlm1686, @shrekris-anyscale, @cassidylaidlaw, @gjoliver, @ckw017, @hakeemta, @ilee300a, @avivhaber, @matthewdeng, @afarid, @pcmoritz, @Chong-Li, @Catch-Bull, @justinvyu, @iycheng