Skip to content

Ray-2.2.0

Compare
Choose a tag to compare
@xwjiang2010 xwjiang2010 released this 13 Dec 19:56
· 1 commit to releases/2.2.0 since this release
840215b

Release Highlights

Ray 2.2 is a stability-focused release, featuring stability improvements across many Ray components.

  • Ray Jobs API is now GA. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application.
  • Ray Dashboard has received a number of improvements, such as the ability to see cpu flame graphs of your Ray workers and new metrics for memory usage.
  • The Out-Of-Memory (OOM) Monitor is now enabled by default. This will increase the stability of memory-intensive applications on top of Ray.
  • [Ray Data] we’ve heard numerous users report that when files are too large, Ray Data can have out-of-memory or performance issues. In this release, we’re enabling dynamic block splitting by default, which will address the above issues by avoiding holding too much data in memory.

Ray Libraries

Ray AIR

🎉 New Features:

  • Add a NumPy first path for Torch and TensorFlow Predictors (#28917)

💫Enhancements:

  • Suppress "NumPy array is not writable" error in torch conversion (#29808)
  • Add node rank and local world size info to session (#29919)

🔨 Fixes:

  • Fix MLflow database integrity error (#29794)
  • Fix ResourceChangingScheduler dropping PlacementGroupFactory args (#30304)
  • Fix bug passing 'raise' to FailureConfig (#30814)
  • Fix reserved CPU warning if no CPUs are used (#30598)

📖Documentation:

  • Fix examples and docs to specify batch_format in BatchMapper (#30438)

🏗 Architecture refactoring:

  • Deprecate Wandb mixin (#29828)
  • Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#30365)

Ray Data Processing

🎉 New Features:

  • Support all PyArrow versions released by Apache Arrow (#29993, #29999)
  • Add select_columns() to select a subset of columns (#29081)
  • Add write_tfrecords() to write TFRecord files (#29448)
  • Support MongoDB data source (#28550)
  • Enable dynamic block splitting by default (#30284)
  • Add from_torch() to create dataset from Torch dataset (#29588)
  • Add from_tf() to create dataset from TensorFlow dataset (#29591)
  • Allow to set batch_size in BatchMapper (#29193)
  • Support read/write from/to local node file system (#29565)

💫Enhancements:

  • Add include_paths in read_images() to return image file path (#30007)
  • Print out Dataset statistics automatically after execution (#29876)
  • Cast tensor extension type to opaque object dtype in to_pandas() and to_dask() (#29417)
  • Encode number of dimensions in variable-shaped tensor extension type (#29281)
  • Fuse AllToAllStage and OneToOneStage with compatible remote args (#29561)
  • Change read_tfrecords() output from Pandas to Arrow format (#30390)
  • Handle all Ray errors in task compute strategy (#30696)
  • Allow nested Chain preprocessors (#29706)
  • Warn user if missing columns and support str exclude in Concatenator (#29443)
  • Raise ValueError if preprocessor column doesn't exist (#29643)

🔨 Fixes:

  • Support custom resource with remote args for random_shuffle() (#29276)
  • Support custom resource with remote args for random_shuffle_each_window() (#29482)
  • Add PublicAPI annotation to preprocessors (#29434)
  • Tensor extension column concatenation fixes (#29479)
  • Fix iter_batches() to not return empty batch (#29638)
  • Change map_batches() to fetch input blocks on-demand (#29289)
  • Change take_all() to not accept limit argument (#29746)
  • Convert between block and batch correctly for map_groups() (#30172)
  • Fix stats() call causing Dataset schema to be unset (#29635)
  • Raise error when batch_format is not specified for BatchMapper (#30366)
  • Fix ndarray representation of single-element ragged tensor slices (#30514)

📖Documentation:

  • Improve map_batches() documentation about execution model and UDF pickle-ability requirement (#29233)
  • Improve to_tf() docstring (#29464)

Ray Train

🎉 New Features:

💫Enhancements:

  • Fast fail upon single worker failure (#29927)
  • Optimize checkpoint conversion logic (#29785)

🔨 Fixes:

  • Propagate DatasetContext to training workers (#29192)
  • Show correct error message on training failure (#29908)
  • Fix prepare_data_loader with enable_reproducibility (#30266)
  • Fix usage of NCCL_BLOCKING_WAIT (#29562)

📖Documentation:

  • Deduplicate Train examples (#29667)

🏗 Architecture refactoring:

  • Hard deprecate train.report (#29613)
  • Remove deprecated Train modules (#29960)
  • Deprecate old prepare_model DDP args #30364

Ray Tune

🎉 New Features:

  • Make Tuner.restore work with relative experiment paths (#30363)
  • Tuner.restore from a local directory that has moved (#29920)

💫Enhancements:

  • with_resources takes in a ScalingConfig (#30259)
  • Keep resource specifications when nesting with_resources in with_parameters (#29740)
  • Add trial_name_creator and trial_dirname_creator to TuneConfig (#30123)
  • Add option to not override the working directory (#29258)
  • Only convert a BaseTrainer to Trainable once in the Tuner (#30355)
  • Dynamically identify PyTorch Lightning Callback hooks (#30045)
  • Make remote_checkpoint_dir work with query strings (#30125)
  • Make cloud checkpointing retry configurable (#30111)
  • Sync experiment-checkpoints more often (#30187)
  • Update generate_id algorithm (#29900)

🔨 Fixes:

  • Catch SyncerCallback failure with dead node (#29438)
  • Do not warn in BayesOpt w/ Uniform sampler (#30350)
  • Fix ResourceChangingScheduler dropping PGF args (#30304)
  • Fix Jupyter output with Ray Client and Tuner (#29956)
  • Fix tests related to TUNE_ORIG_WORKING_DIR env variable (#30134)

📖Documentation:

  • Add user guide for analyzing results (using ResultGrid and Result) (#29072)
  • Tune checkpointing and Tuner restore docfix (#29411)
  • Fix and clean up PBT examples (#29060)
  • Fix TrialTerminationReporter in docs (#29254)

🏗 Architecture refactoring:

  • Remove hard deprecated SyncClient/Syncer (#30253)
  • Deprecate Wandb mixin, move to setup_wandb() function (#29828)

Ray Serve

🎉 New Features:

  • Guard for high latency requests (#29534)
  • Java API Support (blog)

💫Enhancements:

  • Serve K8s HA benchmarking (#30278)
  • Add method info for http metrics (#29918)

🔨 Fixes:

  • Fix log format error (#28760)
  • Inherit previous deployment num_replicas (29686)
  • Polish serve run deploy message (#29897)
  • Remove calling of get_event_loop from python 3.10

RLlib

🎉 New Features:

  • Fault tolerant, elastic WorkerSets: An asynchronous Ray Actor manager class is now used inside all of RLlib’s Algorithms, adding fully flexible fault tolerance to rollout workers and workers used for evaluation. If one or more workers (which are Ray actors) fails - e.g. due to a SPOT instance going down - the RLlib Algorithm will now flexibly wait it out and periodically try to recreate the failed workers. In the meantime, only the remaining healthy workers are used for sampling and evaluation. (#29938, #30118, #30334, #30252, #29703, #30183, #30327, #29953)

💫Enhancements:

  • RLlib CLI: A new and enhanced RLlib command line interface (CLI) has been added, allowing for automatically downloading example configuration files, python-based config files (defining an AlgorithmConfig object to use), better interoperability between training and evaluation runs, and many more. For a detailed overview of what has changed, check out the new CLI documentation. (#29204, #29459, #30526, #29661, #29972)
  • Checkpoint overhaul: Algorithm checkpoints and Policy checkpoints are now more cohesive and transparent. All checkpoints are now characterized by a directory (with files and maybe sub-directories), rather than a single pickle file; Both Algorithm and Policy classes now have a utility static method (from_checkpoint()) for directly instantiating instances from a checkpoint directory w/o knowing the original configuration used or any other information (having the checkpoint is sufficient). For a detailed overview, see here. (#28812, #29772, #29370, #29520, #29328)
  • A new metric for APPO/IMPALA/PPO has been added that measures off-policy’ness: The difference in number of grad-updates the sampler policy has received thus far vs the trained policy’s number of grad-updates thus far. (#29983)

🏗 Architecture refactoring:

🔨 Fixes:

📖Documentation:

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

💫Enhancements:

  • The Ray Jobs API has graduated from Beta to GA. This means Ray Jobs will maintain API backward compatibility.
  • Run Ray job entrypoint commands (“driver scripts”) on worker nodes by specifying entrypoint_num_cpus, entrypoint_num_gpus, or entrypoint_resources. (#28564, #28203)
  • (Beta) OpenAPI spec for Ray Jobs REST API (#30417)
  • Improved Ray health checking mechanism. The fix will reduce the frequency of GCS marking raylets fail mistakenly when it is overloaded. (#29346, #29442, #29389, #29924)

🔨 Fixes:

  • Various fixes for hanging / deadlocking (#29491, #29763, #30371, #30425)
  • Set OMP_NUM_THREADS to num_cpus required by task/actors by default (#30496)
  • set worker non recyclable if gpu is envolved by default (#30061)

📖Documentation:

  • General improvements of Ray Core docs, including design patterns and tasks.

Ray Clusters

💫Enhancements:

  • Stability improvements for Ray Autoscaler / KubeRay Operator integration. (#29933 , #30281, #30502)

Dashboard

🎉 New Features:

  • Additional improvements from the default metrics dashboard. We now have actor, placement group, and per component memory usage breakdown. You can see details from the doc.
  • New profiling feature using py-spy under the hood. You can click buttons to see stack trace or cpu flame graphs of your workers.
  • Autoscaler and job events are available from the dashboard. You can also access the same data using ray list cluster-events.

🔨 Fixes:

  • Stability improvements from the dashboard
  • Dashboard now works at large scale cluster! It is tested with 250 nodes and 10K+ actors (which matches the Ray scalability envelope).
    • Smarter api fetching logic. We now wait for the previous API to finish before sending a new API request when polling for new data.
    • Fix agent memory leak and high CPU usage.

💫Enhancements:

  • General improvements to the progress bar. You can now see progress bars for each task name if you drill into the job details.
  • More metadata is available in the jobs and actors tables.
  • There is now a feedback button embedded into the dashboard. Please submit any bug reports or suggestions!

Many thanks to all those who contributed to this release!

@shrekris-anyscale, @rickyyx, @scottjlee, @shogohida, @liuyang-my, @matthewdeng, @wjrforcyber, @linusbiostat, @clarkzinzow, @justinvyu, @zygi, @christy, @amogkam, @cool-RR, @jiaodong, @EvgeniiTitov, @jjyao, @ilee300a, @jianoaix, @rkooo567, @mattip, @maxpumperla, @ericl, @cadedaniel, @bveeramani, @rueian, @stephanie-wang, @lcipolina, @bparaj, @JoonHong-Kim, @avnishn, @tomsunelite, @larrylian, @alanwguo, @VishDev12, @c21, @dmatrix, @xwjiang2010, @thomasdesr, @tiangolo, @sokratisvas, @heyitsmui, @scv119, @pcmoritz, @bhavika, @yzs981130, @andraxin, @Chong-Li, @clarng, @acxz, @ckw017, @krfricke, @kouroshHakha, @sijieamoy, @iycheng, @gjoliver, @peytondmurray, @xcharleslin, @DmitriGekhtman, @andreichalapco, @vitrioil, @architkulkarni, @simon-mo, @ArturNiederfahrenhorst, @sihanwang41, @pabloem, @sven1977, @avivhaber, @wuisawesome, @jovany-wang, @Yard1